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1  EXECUTIVE  SUMMARY 

The  current  process  for  planning  of  missions  in  Joint  Air  Operations  (JAO)  is  slow  and 
manually  intensive.  This  research,  performed  by  ALPHATECH,  focused  on  the  problem  of 
providing  military  commanders  with  real-time,  optimal  control  of  military  air-to-ground 
operations  through  the  use  of  fast,  near  optimal  mission  replanning,  using  control  algorithms  that 
anticipate  possible  mission  modifications  due  to  uncertain  future  events  for  a  24-hour  segment  of 
a  JAO  campaign.  The  primary  benefit  of  this  technology  is  agile  and  stable  control  of 
distributed,  dynamic  military  operations  conducted  in  inherently  uncertain,  hostile,  and  rapidly 
changing  environments.  The  goal  of  the  JAO  controller  is  to  achieve  specified  Joint  Force  Air 
Component  Commander  (JFACC)  objectives  while  minimizing  the  friendly  asset  losses.  The 
controller  generates  or  updates  mission  definitions  for  both  base  assets  and  airborne  assets.  The 
mission  definition  includes  a  target,  mission  waypoints,  strike  package  composition,  weapon 
composition,  and  desired  time-on-target.  Key  features  of  the  JAO  environment  include  risk  and 
reward  that  are  dependent  on  package  composition.  The  outcome  of  JAO  events  —  target 
destruction,  threat  destruction,  friendly  asset  attrition,  emerging  targets  and  threats  —  are 
uncertain,  and  thus  missions  must  be  adapted  based  on  the  observed  outcomes.  Due  to  the 
potential  scarcity  of  resources,  efficient  resource  utilization  and  adaptation  to  emerging 
battlespace  conditions  are  paramount  to  assure  a  successful  mission. 

In  order  to  develop  controllers  which  address  the  uncertain,  dynamic  nature  of  the  JAO 
statement,  we  adopted  a  framework  for  dynamic  decision  making  under  uncertainty,  known  as 
Markov  Decision  problems.  In  this  framework,  optimal  decisions  are  chosen  based  on  the  most 
recent  information,  and  the  selected  decisions  must  hedge  against  possible  future  contingencies. 
This  explicit  modeling  of  future  contingencies  results  in  proactive  versus  reactive  control 
behaviors.  This  proactive  attribute  is  desirable  for  stable  and  agile  control  of  the  JAO  enterprise 
because  future  information  arrival  and  control  opportunities  are  dependent  on  stringent  spatial, 
temporal,  and  coordination  constraints. 

The  principal  approach  for  control  design  using  Markov  Decision  problems  is  the 
Stochastic  Dynamic  Programming  (SDP)  algorithm.  However,  it  is  well  known  that  this 
approach  suffers  from  the  curse  of  dimensionality  and  is  intractable  (lacks  scalability)  for 
realistic  sized  problems.  Thus,  our  investigations  focused  on  developing  Approximate  Dynamics 
Programming  (ADP)  strategies  that  provide  the  desirable  proactive  control  behaviors  but  can  be 
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computed  in  near  real-time  for  realistic  JAO  scenarios.  The  control  design  technology  is  based 
on  combining  hybrid  state  modeling  techniques  for  developing  statistical  dynamical  models 
relating  mission  decisions  to  evolution  of  objects  in  the  battlespace,  together  with  ADP  control 
design  techniques  that  have  demonstrated  real-time,  proactive  performance  for  other  relevant 
military  problems.  In  our  investigations,  we  developed  a  broad  spectrum  of  ADP  control 
techniques  for  the  JAO  problem,  and  evaluated  their  relative  performance  and  scalability.  The 
technical  accomplishments  of  this  research  are  summarized  below: 

•  Translated  the  JAO  Control  Enterprise  into  a  Dynamical  Hybrid  State,  Discrete  Event, 
Stochastic  Decision  Making  Problem 

•  Integrated  Emerging  ADP  Technologies  into  JAO  Feedback  Controllers 

•  Experimentally  Demonstrated  the  Benefits  of  Feedback  Control 

•  Experimentally  Demonstrated  Benefits  of  Approximate  Optimal  Control 

•  Developed  Innovative  Hybrid,  Multi-Rate  Control  Architecture 

•  Developed  Computationally  Efficient  Control  Algorithms  that  Produce  Operationally 
Consistent  Behaviors 

•  Extended  Control  Algorithms  to  Accommodate  Hierarchical  Mission  Tasking  and  ISR 
Information  Collection 

The  control  algorithms  developed  in  these  investigations  achieve  the  desired  research 
objective  of  automating  military  operations  planning  to  provide  real-time,  near-optimal  control 
strategies  that  achieve  operational  objectives  while  minimizing  asset  losses.  Adopting  a  hybrid, 
multi-rate  control  architecture  permitted  the  tailored  application  of  control  to  the  operational 
situation  at  hand;  faster  control  was  used  when  actions  had  relatively  local  effects  (e.g.  a  mission 
detour  or  divert),  followed  by  slower  modifications  to  address  overall  mission  strategy.  Given 
this  architecture,  a  spectrum  of  ADP  control  strategies  were  developed  and  implemented  that 
produce  immediate  restasking  or  abort  decisions  to  preplanned  multiple  wave  tasking.  The 
solution  quality  and  computation  performance  of  these  algorithms  was  tested  and  verified  in  a 
JAO  simulator.  It  was  shown  through  experimentation  that  the  ADP  strategies  were  able  to 
produce  operationally  consistent,  proactive  control  strategies  that  anticipated  likely  contingencies 
and  positioned  assets  for  opportunities  of  recourse  all  in  either  real-time  or  near  real-time. 
Furthermore,  a  scalability  assessment  illustrated  that  many  of  the  ADP  controllers  could  provide 
near  real-time  performance  for  scenarios  with  250  targets  with  some  modest  parallel  computatio. 
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2  INTRODUCTION 


2.1  BACKGROUND 

The  current  process  for  JAO  planning  and  execution — ^based  on  the  steps  of  Strategy 
Development,  GAT  (Guidance,  Apportionment,  and  Targeting),  MAAP  (Master  Air  Attack 
Planning),  and  Air  Tasking  Order  (ATO)  Production — is  illustrated  in  Figure  1 .  The  product  is 
published  every  24  hours.  End-to-end  development  time  typically  requires  72  hours. 


The  ATO  Process  is  a 
Continuous,  Overiapped  72 
Hour  Process. 

The  JFACC  Strategy  team  develops  the 
overarching  Air  &  Space  Strategy  in  concert 
with  the  JFC’s  Operations  Plan.  They 
develop,  refine,  and  disseminate  the  long- 
range  Theater  Air  &  Space  Operations  Plan. 
They  continually  assess  plan  execution 
against  their  overarching  strategy. 

The  JFACC  GAT  team  develops  1)  the 
Guidance  Letter  that  addresses  planning  and 
apportionment,  2)  the  Joint  Integrated 
Prioritized  Target  List  (JIPTL)  accordance 
with  prioritized  tasks  in  the  Air  &  Space  Ops 
Plan,  and  3)  the  Master  Air  Attack  Plan 
(MAAP)  that  contains  JFC  &  JFACC 
guidance,  support  plans,  component 
requests,  target  update  requests,  forces 
availability,  target  information,  aircraft 
allocation,  etc. 

The  JFACC  ATO  Production  team  prepares 
detailed  plans  that  provide  operational  and 
tactical  direction  to  wing  level  commanders 
including:  1)  the  detailed  Air  Tasking  Order 
(ATO),  2)  Special  Instructions  (SPINS),  and 
3)  the  Airspace  Coordination  Orders  (ACO). 


Figure  1  Key  Steps  in  the  Current  JAO  Process  Limit  Agility  and  Responsiveness 
The  heavily  sequential  nature  of  the  current  JAO  process  hinders  the  JFACC’s  ability  to 
operate  within  the  decision  cycles  of  our  adversaries.  Moreover,  brute  force  attempts  to  adapt 
the  current  process  (e.g.,  diverting  resources  to  engage  time-critical  targets)  often  lead  to 
unstable  operations.  Given  the  realities  of  the  dynamic  problem  above,  the  current  process 
suffers  from  the  following  limitations. 

Lack  of  Agility:  Today’s  JAO  planning  tools  tend  to  produce  tightly  woven  plans. 

Given  the  problem  as  stated  by  the  user,  planners  recognize  the  dependencies  between  resources, 
and  generate  solutions  that  maximize  effectiveness  by  pushing  resources  to  limits.  Alas, 
unanticipated  changes  to  the  plan  can  generate  significant  ripple  effects  due  to  strong 
dependencies.  These  dependencies  are  intrinsic  to  the  JAO  control  problem:  the  delivery  of  a 
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missile  to  a  target  requires  the  coordination  of  multiple  resources  1)  to  locate  and  identify  the 
target,  2)  to  launch  the  weapon,  3)  to  provide  safe  ingress  and  egress,  4)  to  provide  sufficient  fuel 
to  airborne  assets  and  5)  to  assess  bomb  damage. 

Planners  attempt  to  avoid  ripple  effects  by:  1)  scheduling  redundancy  into  the  plan;  or 
when  changes  are  needed,  by  2)  terminating  execution  of  all  portions  of  the  plan  related  to  the 
breakpoint,  or  3)  ignoring  the  dependencies  and  allow  the  other  portions  of  the  plan  to  continue. 
Redundancy  implies  inefficient  use  of  assets,  termination  of  a  portion  of  a  plan  implies  ineffec¬ 
tive  use  of  assets,  and  ignoring  dependencies  implies  risk  of  instability,  i.e.,  mission  failure. 

Ad  Hoc  Stability:  Current  systems  rely  on  human  operators  to  serve  as  a  stabilizing 
force  by  assessing  the  likely  impact  of  unanticipated  events.  Although  human  operators  are  very 
good  at  adapting  to  new  situations,  their  performance  degrades  dramatically  when  they  are 
overloaded  with  information  and  tasks.  Since  the  effectiveness  of  the  assessment  depends  vitally 
on  the  operator’s  insight  into  the  plan,  and  on  the  time  available  to  make  a  decision,  the 
drawback  to  this  approach  is  the  highly  subjective  and  non-comprehensive  nature  of  the 
operator’s  assessment.  In  addition,  this  task  often  becomes  a  pacing  task,  preventing  the 
approach  from  scaling  to  highly  dynamic  environments. 

Ineffective  Feedback:  Current  systems  are  ineffective  at  providing  feedback  to  guide 
the  use  of  JAO  resources.  Fortunately,  more  sensors  will  soon  supply  more  timely  information 
on  the  state  of  the  battlespace.  Given  sufficient  agility  in  the  attack  and  sensor  platforms,  the 
challenge  is  to  reengineer  the  JAO  process  to  incorporate  this  information  in  a  disciplined 
manner.  This  also  requires  the  JAO  planning  process  to  actively  guide  the  information  collection 
process  in  support  of  mission  execution  by  providing  timely  information  need  specifications. 

Ineffective  Use  of  Assets  and  Resources:  Current  systems  tend  to  build  redundancy 
into  the  plans  in  order  to  provide  a  degree  of  agility  (e.g.,  place  aircraft  on  “SCUD  CAP”  in  case 
a  time-critical  target  (TCT)  appears).  Newer  concepts  focus  on  diverting  ongoing  missions  to 
deal  with  significant  changes  in  the  situation.  Unfortunately,  this  agility  often  comes  at  the 
expense  of  major  disruptions  to  the  execution  of  the  remainder  of  the  plan.  For  example,  the 
diversion  of  an  electronic  countermeasures  mission  to  support  a  TCT  kill  mission  may  result  in 
the  cancellation  of  several  planned  missions — again,  leading  to  ineffective  use  of  our  resources. 
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2.2  PROBLEM  DESCRIPTION 

Within  the  Joint  Air  Operations  Enterprise  model,  ALPHATECH’s  research  is  focused 
on  generating  real-time,  optimal  control  strategies  for  a  24-hour  segment  on  a  JAO  campaign.  It 
is  assumed  that  some  form  of  higher  level  decomposition  of  the  battle  space  has  been  performed, 
and  that  an  Area  of  Responsibility  (AOR)  has  been  defined  that  includes  approximately  100 
targets.  The  goal  of  the  controller  is  to  achieve  the  specified  JFACC  objective  for  the  AOR 
while  minimizing  the  friendly  asset  losses.  The  controller  will  generate/update  mission 
definitions  for  both  base  assets  and  airborne.  The  mission  definition  includes  a  target,  high-level 
waypoints,  strike  package  composition,  weapon  composition,  and  desired  time-on-target.  Inputs 
to  the  controller  include  a  known  target  list,  available  assets,  known  threats  list,  and  some 
indication  of  the  likelihood  of  emerging  target  and/or  threats.  Through  the  continuous  gathering 
of  Intelligence,  Surveillance,  and  Reconnaissance  (ISR)  data,  the  control  system  will  generate 
these  mission  definitions  on  a  time  scale  of  approximately  15  minutes.  The  updates  can  be  as 
benign  as  continue  current  plan  or  as  drastic  as  reroll  some  packages,  abort  others,  or  launch  new 
packages. 

Key  features  of  the  JAO  environment  of  interest  include  risk  and  reward  that  are 
dependent  on  package  composition  and  weaponeering.  The  outcome  of  JAO  events  —  target 
destruction,  threat  destruction,  friendly  asset  attrition,  emerging  targets  and  threats  —  are 
uncertain  in  realistic  JAO  environments.  Finally,  limited  resources  and  dealing  with  emerging 
and  threats  are  paramount  to  the  successful  execution  of  the  JFACC  objective. 

2.3  THEORETICAL  TECHNIQUE 

To  address  the  limitations  of  the  current  JAO  control  process,  ALPHATECH  will 
investigate  the  utility  of  a  comprehensive  new  approach  based  upon  modem  control  theory. 
Because  of  its  complexity  and  the  time  critical  nature  of  the  decisions  that  must  be  made, 
automated  decision  aids  must  support  JAO  operators.  Past  attempts  at  developing  such  decision 
aids  have  been  based  upon  planning  technology.  Unfortunately,  the  plans  produced  by  these 
decision  aids  are  often  rendered  obsolete  by  unforeseen  events.  This  difficulty  can  be  addressed 
by  periodic  replanning,  either  in  full  or  (more  commonly)  by  partial  modification  of  the  plan. 
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However,  planning  technology  provides  little  guidance  on  1)  how  this  replanning  should  be 
conducted,  nor  2)  how  plans  can  be  made  robust  to  the  need  for  replanning  and  repair. 

In  contrast,  control  theory  uses  the  idea  of  continuous  feedback  to  minimize  the  impact  of 
uncertainty.  Uncertainty  is  explicitly  represented  in  both  dynamics  and  observation  models. 
Feedback  control  laws  select  current  decisions  as  a  function  of  the  current  (estimated)  state  of  the 
system.  These  control  laws  are  selected  to  optimize  a  quantitative  objective  criterion,  subject  to 
constraints  on  controls,  dynamics,  and  observability.  Control  laws  for  current  decisions 
explicitly  consider  the  fact  that  future  decisions  will  be  made  in  the  same  optimal  manner,  based 
upon  state  information  available  in  the  future  [B96].  Thus  control  theory  provides  an  extensive 
conceptual  foundation  for  continuous  JAO. 

2.4  VALUE  TO  MILITARY 

The  goal  of  this  research  is  to  provide  real-time  dynamic  control  of  military  air  operations 
via  near  optimal  mission  assignments,  which  are  robust  under  replanning.  The  primary  benefit 
of  this  technology  is  agile  and  stable  control  of  distributed  and  dynamic  military  operations 
conducted  in  inherently  uncertain,  hostile,  and  rapidly  changing  environments. 

Utilizing  the  available  real-time  information  from  the  battle  space,  the  propose  algorithm 
generates  mission  assignments  resembling  an  ATO  to  achieve  a  desired  JFACC  objective.  The 
fact  of  using  feedback  to  create  or  update  mission  assignment  desensitizes  the  desired  outcome  to 
modeling  error  and  uncertainties  that  are  inherently  associated  with  military  air  operations. 
Furthermore,  by  formulating  the  control  problem  as  an  optimal  control  problem,  the 
recommendations  will  anticipate  key  uncertainties  and  provide  opportunities  for  recourse.  In 
fact,  the  optimal  solution  will  achieve  the  desired  JFACC  objective  by  minimizing  the 
operational  cost  and  risk  to  our  assets.  As  an  example,  the  optimal  solution  can  identify  a  critical 
communication  linkage  that  if  destroyed  can  achieve  air  superiority  without  having  to  destroy 
every  component  of  an  Integrated  Air  Defense  System  (IADS).  Given  the  progression  towards 
more  Unmanned  Air  Vehicles  (AUVs),  the  proposed  system  could  be  used  to  automatically 
provide  the  UAV  with  its  mission.  The  benefits  of  this  technology  to  the  military  is  summarized 
below  using  the  taxonomy  presented  in  the  BAA: 
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Increased  Agility:  Approximate  optimal  control  techniques  generate  solution  that 
permit  opportunities  of  recourse  for  key  uncertain  JAO  outcomes;  thus,  increasing  the  agility  of 
JAO  operations  by  proactively  (versus  reactively),  consistently,  and  efficiently  responding  to 
changes  in  the  environment. 

Flexibility:  This  technology  is  applicable  to  a  wide  spectrum  of  military  conflicts. 

Stability:  Feedback  control  provides  an  automated  system  to  stabilize  the  JAO 
environment;  thus,  eliminating  the  need  for  ad  hoc  stabilization  via  human  operators. 

Effective  Feedback:  Feedback  control  provides  an  natural  framework  to  fuse  large 
volumes  of  data  to  produce  a  coherent  control  strategies. 

Effective  Use  of  Assets  and  Resources:  Optimal  control  generates  control  solution  that 
effectively  use  the  available  resources. 

Automated  Operations:  Envisioned  is  a  prototype  system  that  fits  into  existing  C2 
AOC.  This  system  would  monitor  the  progression  of  the  battle  space  and  generate  real-time 
control  strategies.  The  system  could  automatically  manage  autonomous  assets. 

2.5  REPORT  LAYOUT 

This  report  has  three  primary  sections.  In  Section  3,  the  details  of  the  JAO  plant 
dynamics  are  presented.  Next,  Section  4  contains  a  detailed  discussion  of  the  JAO  controllers 
formulation  and  implementation.  Finally,  Section  5  summarizes  the  key  experimental  results  for 
the  dynamics  and  controllers  that  were  implemented.  In  addition  to  these  primary  sections,  a 
series  of  appendices  are  included  to  compliment  the  main  body  of  the  report. 
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3  JAO  PLANT  DYANMICS 

In  this  section,  we  develop  the  JAO  plant  dynamics  in  which  air  package  composition 
and  tasking  are  represented  in  the  context  of  a  Joint  Force  Air  Component  Commander  (JFACC) 
air-to-ground  probem.  Our  model  provides  a  flexible  representation  of  the  JFACC  problem  that 
“matches”  the  fidelity  of  our  control  approach  (and  has  evolved  accordingly),  allowing  for 
efficient  experimentation.  The  presentation  of  the  JAO  plant  dynamics  begins  with  a  general 
discussion  of  the  JAO  air-to-ground  problem.  This  discussion  is  followed  by  the  presentation  of 
the  dynamics  that  were  implemented  for  this  research.  Finally,  the  details  of  the  JAO  plant 
dynamics  implementation  will  be  presented. 

3.1  JOA  AIR-TO-GROUND  PROBLEM  DESCRIPTION 

In  this  section,  we  describe  the  general  JFACC  air-to-ground  problem  from  which  we 
will  distill  the  salient  JAO  plant  dynamics  in  subsequent  sections. 

The  objective  of  the  JFACC  strike  mission  planning  problem  is  to  maximize  damage  to 
enemy  targets  while  minimizing  losses  of  our  own  aircraft.  Strike  missions  involve  targets,  air 
defense  units,  strike  aircraft,  and  Suppression  of  Enemy  Air  Defense  (SEAD)  aircraft.  We 
discuss  the  roles  of  each  below. 

Targets  are  objects  the  JFACC  wishes  to  damage.  The  JFACC  objectives  may  specify  a 
desired  effect  such  as  destroying,  temporarily  disabling,  or  disrupting  the  target.  Targets  are 
vulnerable  to  strike  aircraft,  which  are  discussed  below.  Multiple  strike  aircraft  or  special  tactics 
may  be  required  to  achieve  the  desired  effect  when  targets  are  geographically  dispersed  (i.e., 
have  multiple  aim  points).  Targets  may  also  be  coupled  (physically  or  logically),  and  the  strike 
mission  may  need  to  achieve  some  threshold  damage  before  attaining  the  JFACC's  objective. 
Collateral  damage  is  always  a  concern,  and  application  of  strike  force  above  some  threshold  may 
lead  to  undesirable  outcomes.  Multiple  strike  missions  may  also  need  to  be  coordinated  in  time 
(i.e.,  sequenced  or  synchronized)  to  achieve  the  desired  effects. 

The  true  state  of  the  target  (including  location,  identity,  function,  and  value)  is  dynamic 
and  may  not  be  known  with  certainty.  Targets  may  move  or  hide,  and  their  functions  may 
change,  making  their  prosecution  more  or  less  desirable  over  time.  Strike  mission  planners  do 
not  necessarily  know  all  potential  targets  beforehand,  and  they  may  elect  to  keep  some  strike 
forces  in  reserve  to  attack  targets  that  were  heretofore  unknown.  These  dynamics  imply  that  the 
targets  vulnerability  and  value  vary  with  time.  A  time  is  typically  specified  for  which  a  specific 
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strike  mission  is  likely  to  achieve  the  desired  effect.  In  general,  strike  missions  will  need  to  be 
tailored  depending  on  when  and  where  the  target  will  be  prosecuted.  Adversaries  protect  targets 
using  a  combination  of  passive  and  active  defense  measures.  Passive  measures  make  targets 
harder  to  kill  and/or  prosecute;  they  include  hardening,  cover,  concealment,  and  deception 
(HCCD).  Active  defense  directly  attacks  aircraft  in  the  strike  mission.  We  discuss  Air  Defense 
units  next. 

Air  Defense  (AD)  units  defend  targets.  The  effectiveness  of  an  AD  unit  depends  on  the 
type  of  AD  unit,  the  type  attacking  aircraft,  the  tactics  of  both  the  aircraft  and  the  AD  unit,  and  a 
variety  of  external  influences  such  as  weather.  AD  units  are  vulnerable  to  various  type  of 
suppression,  which  may  be  available  on  the  strike  aircraft  but  are  concentrated  typically  on 
SEAD  aircraft  (discussed  below).  Suppression  may  be  in  the  form  of  munitions  that  destroy 
critical  components  of  the  AD  unit.  Although  this  is  typically  permanent,  AD  units  may  be 
repaired  given  enough  time.  Other  types  of  suppression  such  as  jamming,  chaff,  and  decoys, 
temporally  disable  or  deceive  AD  units,  which  make  them  less  effective.  The  AD  unit's 
effectiveness  may  be  enhanced  by  strategic  placement,  such  as  protecting  multiple  targets  or 
overlapping  to  protect  a  single  target.  AD  units  may  also  be  connected  or  coupled  which  tend  to 
improve  their  effectiveness  and  reduce  their  vulnerability. 

Similar  to  targets,  the  true  state  of  the  AD  units  is  dynamic  and  may  not  be  known  with 
certainty.  The  AD  units  may  be  known  or  unknown,  move  or  hide,  they  may  choose  not  to 
expose  themselves  to  suppression  aircraft,  or  may  operate  at  a  lower  effectiveness  to  reduce 
vulnerability. 

Strike  aircraft  prosecute  targets.  Prosecution  encompasses  a  range  of  effects,  from 
permanently  destroying  the  target,  to  temporarily  disabling  it,  to  simply  disrupting  it's  operation. 
However,  strike  aircraft  are  vulnerable  to  enemy  Air  Defense  (AD).  Therefore,  the  aircraft's 
effectiveness  is  judged  by  its  ability  to  circumvent  AD  and  achieve  the  desired  effect  on  the 
target.  To  accomplish  these  objectives,  the  aircraft  is  configured  with  munitions  and  the  ability  to 
suppress  enemy  air  defense. 

Each  Strike  aircraft  is  configured  with  a  limited  number  of  munitions,  which  are  non¬ 
renewable  and  may  only  be  replenished  at  an  airbase  (provided  the  supply  exists).  Selection  of 
the  munitions,  as  well  as  the  tactics  used  to  deploy  them,  determine  the  effect  on  the  target  (i.e., 
destroy,  disable,  or  disrupt)  and  depend  on  several  external  influences  including  terrain,  type  of 
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target,  air  defenses,  potential  for  collateral  damage,  and  weather.  Much  of  the  strike  aircraft's 
ability  to  suppress  enemy  air  defense  also  comes  from  nonrenewable  resources  such  as  chaff, 
flares,  and  decoys.  Strike  aircraft  may  also  be  equipped  with  self-protection  jamming  equipment; 
this  equipment  is  not  depletable,  but  has  power  and  duty  cycle  constraints.  We  discuss  renewable 
resources  for  suppressing  enemy  air  defense  further  when  we  introduce  Weasel  aircraft  below. 
Another  important  nonrenewable  is  resource  is  fuel.  Fuel  may  be  replenished  at  appropriate 
facilities,  including  air  bases  and  orbiting  tanker  stations.  SEAD  aircraft  help  the  strike  aircraft 
through  the  enemy's  air  defenses.  Some  SEAD  aircraft  are  configured  to  physically  attack  air 
defense  units;  these  are  Weasel  aircraft.  Other  SEAD  aircraft  are  configured  to  jam  air  defense 
units  (primarily  air  defense  radars);  these  are  Jammers. 

Weasel  aircraft  are  configured  with  munitions;  these  are  depletable  resources.  Jammers 
are  equipped  with  an  array  of  jamming  pods;  these  are  not  depletable,  but  have  power  and  duty- 
cycle  constraints. 

SEAD  aircraft  are  packaged  with  strike  aircraft  to  support  a  particular  mission  or  placed 
on  CAP  to  support  a  group  of  missions  in  a  particular  area.  Jammer  aircraft  are  slower  than  strike 
and  weasel  aircraft,  but  their  effect  may  cover  a  wide  geographical  area.  Therefore,  these  aircraft 
are  ideal  for  supporting  multiple  strike  missions  in  a  particular  geographical  area. 

To  achieve  the  JFACC  objectives,  air  packages,  i.e.  team  of  aircraft  that  coordinate  their 
activities  and  fly  in  either  a  loose  or  tight  formation,  must  be  composed,  tasked,  and  retasked 
such  that  the  maximum  number  of  targets  are  prosecuted  with  minimal  resource  losses.  In 
general,  air  package  configuration  number  and  type  of  aircraft  and  resources.  Resources  include 
munitions  type  and  quantity,  SEAD  capability  (including  self-protection  pods),  SEAD 
configuration  (e.g.,  frequency  range),  and  extra  fuel  tanks.  Air  package  tasking  includes  target 
designation,  course  routing  information,  and  potential  opportunities  for  future  retasking. 
Retasking  involves  updating  an  air  package  tasking  in  flight,  which  is  limited  by  a  variety  of 
constraints  but  may  be  valuable  to  high  value,  fleeting  targets. 

In  the  following,  we  develop  our  hybrid  dynamical  model  for  Joint  Air  Operations  (JAO). 
We  start  with  a  general  discussion  of  hybrid  dynamical  models.  Based  on  this  general  discussion, 
we  develop  hybrid  dynamical  models  for  each  of  the  objects  in  the  JAO  environment.  Finally, 
we  introduce  a  composite  hybrid  dynamical  model  that  facilitates  interaction  among  JAO 
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objects.  The  hierarchical  structure  of  the  composite  hybrid  dynamical  model  simplifies  design 
and  analysis,  as  well  as  permitting  a  more  lucid  exposition  of  this  complex  environment. 

3.2  HYBRID  DYNAMICAL  MODEL  FORMULATION 

Consider  the  following  hybrid  dynamical  system  based  on  [Bran95] 

H  =  {Q,I.,A,G,V,C,F) 

where  Qisa  discrete  state  space.  Each  state  (or  mode)  q  gQ  is  associated  with  a  controlled 
dynamical  system,  e  X ,  an  autonomous  jump  set  e  A,  and  a  controlled  jump  set  C^eC. 

The  controlled  dynamical  system  is  given  by  where  is  the  state  space, 

is  an  appropriately  defined  timing  map,  represents  the  dynamics,  and  is  the  continuous 
control  set.  If  the  state  of  the  dynamical  system  enters  A^czX^,  the  system  jumps  autonomously 
to  some  new  mode  p^q  according  to  a  control  mapping  G^,  which  is  parameterized  by  the 
control  set  V^.  Alternatively,  if  the  state  enters  C^<zX^,  the  system  may  be  instructed  to  jump  to 
some  new  mode  pt^q  such  that  XpCzF^. 

3.2.1  Battlespace  Objects 

In  this  section,  we  develop  the  hybrid  dynamical  models  for  air  packages,  targets,  and 

threats. 

3.2. 1. 1  Air  Packages 

An  air  package  dynamics  will  be  using  the  hybrid  dynamical  representation  in  the  above 
section.  The  discrete  state 

q  gZI  X  {ingress.  Egress,  Base} 

where  specifies  the  number  of  Strike  and  Weasel  aircraft  and  [ingress.  Egress,  Base} 
identifies  the  current  operating  conditions  of  the  air  package;  Ingress  indicates  that  the  air 
package  is  fully  loaded  and  may  be  retasked.  Egress  indicates  that  the  air  package  has  already 
engaged  its  target  and  is  not  eligible  for  retask,  and  Base  indicates  that  the  air  package’s  mission 
is  completed  and  the  associated  aircraft  are  available  for  assembly  into  new  air  packages.  The 
continuous  dynamics  of  an  air  package  are  independent  of  the  discrete  state  and  are  given  by 

i:  =  [X,T,(l>,p{r)] 

where  Z  e  the  {x^y)  location  of  the  air  package,  F  is  an  appropriately  defined 
timing  map,  ^  is  a  constant  velocity  kinematic  model,  and  ju[r):  X  is  a  flight  controller  that 
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maps  the  current  state  into  a  continuous  control  input  ueU  given  the  next  waypoint  r. 
Disturbances  are  neglected  in  the  continuous  dynamics.  Waypoints  may  be  introduced  through 
either  autonomous  jumps  via  the  control  set  V  (e.g.,  last  waypoint  has  been  reach,  pick-up  the 
next  waypoint  or  loiter)  or  through  controlled  jumps  specified  in  C  (e.g.,  retask  or  abort  mission). 
We  neglect  disturbance  with  respect  to  waypoint  selection,  i.e.,  no  navigational  error. 

The  autonomous  jump  set  A  is  similarly  defined  for  all  q,  including  waypoint  arrival, 
which  instigates  subsequent  actions.  Autonomous  jumps  also  represent  various  interactions  when 
considered  within  a  composite  model;  these  jumps  are  summarized  in  Figure  2  and  will  be 
discussed  in  detail  in  Section  3.2.2  below.  From  this  figure  we  see  that  threat  engagements  may 
diminish  the  aircraft  in  the  air  package,  and  a  single  engagement  has  the  potential  to  destroy  all 
aircraft;  and  Target  engagements  transition  the  air  package  from  ingress  to  egress. 


Figure  2  Autonomous  Jumps  Represent  Interactions  with  Air  Packages 
The  controlled  jump  set  C  c  X  enables  changes  in  tasking  and/or  configuration  of  air 
packages.  The  mapping  F:C-^2^  where  S  =  XxQ  specifies  a  set  of  feasible  jumps  for  each 
state  X  €  C  (e.g.,  reconfiguration  at  the  base).  We  obtain  the  familiar  control  theoretic  notation 
by  recognizing  that  in  the  absence  of  uncertainty,  F(x)  is  the  set  of  possible  control  actions.  In  a 
stochastic  setting,  F(x)  is  the  set  of  all  possible  outcomes. 
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3.2. 1.2  Observation  Platforms 

In  the  final  experiments,  which  are  run  with  partial  information,  observation  platforms 
are  used.  These  platforms  are  similar  to  the  air  package,  but  only  interact  with  other  objects 
through  detection.  Detection  provides  perfect  information  regarding  specific  objects  in  the  JAO 
environment.  We  typically  consider  a  global  information  model  in  which  information  is 
maintained  independent  of  individual  battlespace  objects.  However,  the  current  implementation 
allows  for  a  more  general  representation  that  accounts  for  distributed  information  and 
communication  in  which  information  models  similar  to  the  global  information  model  are 
associated  with  each  battlespace  object. 

3.2. 1.3  Targets 

Targets  are  also  represented  using  the  hybrid  dynamical  representation  in  introduction  to 
Section  3.2.  In  general,  targets  have  the  following  discrete  state:  q  €  {Known,Unknown,Dead], 
and  may  be  initialized  as  known  or  unknown.  The  continuous  state,  for  our  purposes,  is  a  fixed 
location,  X  =  for  all  q.  Figure  3  illustrates  the  autonomous  and  controlled  jumps  associated 
with  targets. 


Figure  3  Autonomous  and  Controlled  Jumps  Associated  with  Targets 
Autonomous  jumps  (blue)  are  only  used  to  represent  interactions  within  a  composite 
model  namely  engagements  with  air  packages.  The  controlled  jumps  (black)  are  characterized  by 
the  controlled  jump  sets,  C^,  and  the  controlled  jump  transition  maps,  .  These  controlled 
transitions  are  governed  by  a  Poisson  process,  which  may  be  considered  a  stochastic  model  of 
the  adversary,  allowing  the  targets  to  autonomously  hide  or  emerge. 
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3. 2. 1.4  Threats 

Using  the  hybrid  dynamical  representation,  threats  my  be  in  one  of  five  discrete  states, 
q  €  {^Active, Inactive, Unknown, Repairing, Dead],  and  may  be  initialized  as  active,  inactive,  or 

unknown.  As  with  targets,  the  continuous  state  is  a  fixed  location,  X  =  Figure  4 

illustrates  the  autonomous  and  controlled  jumps  associated  with  threats. 


Figure  4  Autonomous  and  Controlled  Jumps  Associated  with  Threats 
Autonomous  jumps  (red)  are  only  used  to  represent  interactions  within  a  composite 
model  namely  engagements  with  air  packages.  The  controlled  jumps  (black  and  green)  are 
characterized  by  the  controlled  jump  sets,C^,  and  the  controlled  jump  transition  maps,  .  These 

transitions  are  governed  by  a  Poisson  process,  which  may  be  considered  a  stochastic  model  of 
the  adversary,  allowing  the  threats  to  autonomously  active,  deactivate,  hide,  emerge,  and  repair. 
In  addition,  we  consider  the  attack,  i.e.  SAM  launch,  transient  state  during  engagement  with  an 
air  package.  We  now  characterize  the  autonomous  behavior  of  a  threat. 

The  state  transitions  that  can  occur  between  engagements  correspond  to  Emerge,  Hide, 
Activate,  Deactivate,  and  Repair.  Note  that  some  transitions  are  only  possible  during  an 
engagement  (e.g.,  transition  to  REAPIRABLE  or  DEAD  from  any  other  state)  while  other 
transition  are  only  possible  between  engagements  (e.g.,  from  REPAIRABLE).  The  autonomous 
transition  dynamics  are  modeled  as  a  stochastic  timed  automaton,  equivalently  a  continuous  time 
Markov  model.  Accordingly,  a  matrix  that  specifies  the  rate  at  which  state  transitions  occur  is 
defined.  These  rates  depend  on  the  time  spent  in  each  state  and  the  transition  probability,  and  are 
represented  concisely  in  the  following  matrix: 
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Active 

Inactive 

Repairable 

Unknown 

Dead 

Active 

• 

P' Inactive\Active 

0 

Unknown\  Active 

0 

Inactive 

^  Active\lnactive 

• 

0 

Unknown\ Inactive 

0 

Repairable 

^  Active\ Repairab  le 

P  Inactive\Repatrable 

• 

^  Unknown\  Repairable 

0 

Unknown 

ActiveYJnknown 

0 

0 

m 

0 

Dead 

0 

0 

0 

0 

0 

where  •  is  defined  so  that  the  rows  sum  to  0.  which  an  active  threat 

deactivates  (i.e.,  observable  but  not  radiating).  which  an  active  threat 

hides  (i.e.,  becomes  unobservable).  H the  rate  at  which  an  inactive  threat  activates 
(i.e.,  starts  radiating).  is  the  rate  at  which  an  inactive  threat  hides.  is  the 

rate  at  which  an  unknown  threat  activates.  The  transition  rate  from  REPAIRED  are  functions  of 
the  repair  rate  y repaired- 

P Activ^Repairable  ~  ^ACTIVE\REPAIRED  '  ^ REPAIR 
P Inactive\Repairable  ~  ^mACnVE\REPAlRED  ’  7 REPAIR 
P'Unknown\RepairabIe  ~  ^UNKN0WN\REPAIRED  '  7 REPAIR 

where  the  repair  rate  is  weighted  by  the  probability  that  the  threat  will  immediately  transition 
into  the  associated  state.  These  probabilities  sum  to  one 

^ACTIVE\REPAIRED  ^INACmE\REPAIRED  ^UNKN0WN\REPAIRED  ~  ^ 

indicating  that  a  repaired  threat  will  become  ^C77F£',  INACTIVE,  or  UNKNOWN.  The  transition 
probabilities  are  derived  from  the  rate  matrix  given  the  time  interval  using  the  following 
matrix  equation. 

P  ( At  ^ 

^ Inter  y^hnter)  ^ 

where  Q  is  the  rate  matrix  and  is  the  time  interval.  For  a  given  initial  probability 
distribution  over  the  state  of  the  threat,  denoted  in  vector  form  by  al,  the  distribution  after 
^hnter^  dcnotcd  (T^,  is  given  by 
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where  is  the  time  interval  between  al  and  <jp.  For  large  enough  cr^  will  achieve  a 

steady  state  distribution,  which  is  useful  to  characterize  initial  information  regarding  the  threat 
(i.e.,  has  not  been  observed  or  interacted  with  in  a  long  time). 

In  the  following,  we  derive  the  steady  state  distribution  for  a  threat  with  the  rate  matrix  Q 
discussed  above.  The  continuous  Markov  model  is  given  by 

with  the  constraint  that  cr^  is  a  proper  distribution  and  the  elements  sum  to  one 

S<x,=i 

i 

Steady  state  is  achieved  when  =  0.  To  compute  the  steady  state  probabilities,  we  define 
=  [O  0  0  0  0  I  l]  and  Q  =  [Q  l].  The  steady  state  distribution  is  then  given  by 

a  =  (QQ^)''Sb 


3.2.2  Battlespace  Dynamics 

When  the  individual  battlespace  objects  are  brought  together  into  a  setting  where 
interactions  are  possible,  we  form  a  composite  hybrid  dynamical  model  with  the  same  form  as 
the  individual  objects.  However,  the  composite  JAO  model  subsumes  the  individual  dynamical 
models  and  introduces  interaction  dynamics  that  depend  on  two  or  more  objects  within  the  JAO 
environment.  For  our  purpose,  interactions  involving  more  than  two  participants  are 
decomposed  into  a  sequence  of  pairwise  interactions. 

Consider  the  composite  hybrid  dynamical  model 

The  discrete  state  space  of  the  composite  model  is  defined  as  the  composition  of 
individual  discrete  state  spaces  =  Ha  ,  where  the  subscript  /  denotes  the  /th  battlespace 

i 

object.  Each  state  (or  mode)  of  the  composite  model  e  is  associated  with  a  composite 

dynamical  system,  2^^  e  =n^.  ,  a  composite  autonomous  jump  set  ^  ~  ^ 

I  '■ 

controlled  jump  set  =  J][  Q .  For  our  purposes,  the  composite  dynamical  system  is  a 

i 

non-interacting  parallel  composition  with  the  exception  of  e  ,  which  is  the 
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composite  state  spaee  on  which  the  autonomous  jumps  sets  c  are  defined.  Therefore,  if 
the  composite  state  of  the  dynamical  system  enters  c  (an  interaction),  the  system  jumps 

autonomously  to  some  new  mode  according  to  a  control  mapping  ,  which  is 

parameterized  by  the  control  set  V^.  These  interactions  effect  one  or  more  of  the  battlespace 
objects. 

This  hierarchical  paradigm  simplifies  the  modeling  process  by  relating  directly  to  the 
physical  reality  (i.e.,  objects  and  interactions)  of  the  JAO  environment.  In  the  following  sections, 
we  discuss  the  interactions  between  air  packages  and  threats,  denoted  threat  engagement,  and 
interactions  air  packages  and  targets,  denoted  target  engagements. 


3.2.2. 1  Detection  Interaction 

A  detection  interaction  occurs  when  an  observation  platform  i  is  within  sensor  range  of 
an  observable  objeet  j 


where  x  eX  is  the  loeation  of  an  observation  platform,  x  eX  is  the  location  of  an 

^  9c  J  He 


observable  object,  and  is  the  sensor  range  associated  with  the  observation  platform  /.  The 
result  of  a  detection  interaction  is  a  discrete  change  in  the  available  information  regarding  the 
battlespace,  but  does  not  otherwise  effect  the  objects.  We  assume  perfect  observations. 

3. 2.2. 2  Target  Engagement 

A  target  engagement  occurs  when  an  air  package  i  arrives  at  the  location  of  target  j 


pc, 


where  x,.  e  X^  is  the  location  of  an  air  package  and  Xj  e  X^^  is  the  location  of  a  target.  The  result 
of  a  target  engagement  may  only  change  the  discrete  state  of  the  target,  and  does  change 
continuous  state  of  either  object,  i.e.,  X^^  =  X^^  if  is  the  discrete  state  prior  to  the  engagement 
and  the  discrete  state  after  the  engagement.  The  control  set  controls  the  transition  based  on 
the  attributes  of  the  interacting  objects  (i.e.,  aim  points  and  salvo)  and  the  random  component 
that  determines  the  outcome. 

The  current  model  of  the  target  engagement  was  proposed  as  part  of  the  enterprise 
modeling  effort.  This  is  a  static,  analytic  model  that  provides  the  probability  of  destroying  the 
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target  as  a  function  of  the  number  of  Strike  aircraft  in  the  air  package,  the  number  of  aimpoint 
associted  with  the  target  and  the  strike  slavo  (i.e.,  the  number  of  munitions).  The  probability  that 
air  package  i  destroys  target  j  is  given  by: 

P.J  =  (l  -  exp(-  numStrikej  /  aimPointSj  ))•  ^1  -  (l  -  Pj*"^**^ )  J 

where  aimPointsj  is  an  attribute  of  the  Target  representing  the  number  of  points  that  need  to  be 
hit  on  a  particular  target,  thus  influencing  the  effectiveness  of  the  air  package,  is  the 

effectiveness  of  the  aircraft’s  munition.  strikeSalvo  is  number  of  munitions  launched  by  the 
strike  aircraft  at  the  target.  This  simple  static  model  is  sufficient  under  for  the  current  modeling 
assumptions  (i.e.,  targets  do  not  defend  themselves). 


3.2.23  Threat  Engagement 

A  threat  engagement  occurs  when  an  air  package  i  is  within  range  of  threat  j 


where  x,.  €  is  the  location  of  an  air  package,  Xj  e  is  the  location  of  a  threat,  and  Rj  is  the 

range  associated  with  threat  j.  The  result  of  a  threat  engagement  may  change  the  discrete  state  of 
either  or  both  participating  objects,  but  does  not  change  their  continuous  state,  i.e.,  X^^  =  X^^  if 
is  the  discrete  state  prior  to  the  engagement  and  the  discrete  state  after  the  engagement. 

The  control  set  controls  the  transition  based  on  the  duration  of  the  engagement  and  the 
random  component  that  determines  the  outcome. 

The  threat  engagement  is  broken  into  three  parts,  pre-engagement,  engagement,  and  post¬ 
engagement.  Pre-engagement  transitions  account  for  emergence  or  activation  of  a  threat  with  the 
intent  of  attacking.  Engagement  transitions  account  for  the  interactive  dynamics  of  an  air 
package  with  an  active  and  attacking  threat.  Post-engagement  transitions  account  for  hiding  or 
shoot-and-scoot  behaviors.  Finally,  we  combine  these  segments  in  a  single  set  of  transition 
dynamics  and  adopt  a  concise  matrix  representation. 

Pre-engagement  transition  accounts  for  emergence  with  the  intent  of  attacking.  The 
probability  that  a  threat  will  attack  is  given  by  the  probability  of  attacking  from  any  given  state 
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(except  DEAD  and  REPAIRABLE)  times  the  probability  of  being  in  that  state.  Note,  in  order  to 
attack,  a  threat  in  the  REPAIRABLE  state  must  have  transitioned  into  another  state  prior  to 
engagement,  otherwise  there  is  no  guarantee  that  sufficient  time  has  elapsed  for  the  “repair”  to 
take  place.  Therefore,  the  probability  of  attack  is  given  by 

PaTTACK  -  P iTTACK\ACnVE  '  PaCTIVE  ^  PATTACK\lNACmE  '  PiNACTIVE  PaTTACK\UNKNOWN  '  PuNKNOWN 

This  formulation  assumes  that  there  is  some  form  of  sensing  available  (possibly  visual) 
that  cues  threat  activation  for  subsequent  attack  from  INACTIVE  and  UNKNOWN  states.  We  also 
assume  that  an  “attacking”  threat  is  identifiable  (e.g.,  by  tracking  radar,  etc.),  thus  only  threats 
that  attacking  are  engaged.  Therefore,  if  the  threat  does  not  attack,  there  are  no  pre-engagement 
transitions. 


pPre  /i 

^ACTIVE  y 

“  ^ATTACK\ACTIV^'  ^ACTIVE 

V 

II 

“  ^ATTACK\INACT1V^  ' 

P 

^INACTIVE 

^REPAIRABLE 

^REPAIRABLE 

pPre  / 

^UNKNOWN 

1  —  PjiTTACKpNKNOWN  ^ 

) '  ^UNKNOWN 

pPre  _  p 
^DEAD  ”  ^ DEAD 

The  engagement  transition  accounts  for  uncertainty  associated  with  the  interaction.  In 
general,  engagements  may  be  considered  suppression  or  lethal.  Suppression  engagements  disable 
the  threat,  thus  transitioning  into  the  REPAIRABLE  state  while  lethal  engagements  destroy  the 
threat,  thus  transitioning  into  the  DEAD  state.  We  distinguish  lethal  engagements 
probabilistically  with  Pdeai^success  ^  successful  attack  on  the  threat.  This  indicates  that 

sufficient  damage  was  done  to  transition  the  threat  to  DEAD.  The  threat  must  be  ACTIVE  after 
the  pre-engagement  transition  to  engage,  and  will  transition  to  eUhtx  ACTIVE,  REPAIRABLE, 
or  DEAD  as  a  result  of  the  engagement. 


pEngage  _  p  D  i  pPre 

^ACTIVE  ~  ^FAlL\AnACK  ^ATTACK  ^  ^ACTIVE 

pEngage  _  n  \u 

^REPAIRABLE  ^  y  ^DEAD\SUCCESS )  ^  SUCCESS\ 


The  engagement  dynamics  are  characterized  by  Pfail\attack 
PsuccEsslATTACK  =  ^ "  Pfail\attack  » which  dcpcnd  on  the  interacting  air  package.  Threat  that  are 
initially  INACTIVE,  UNKNOWN,  or  DEAD  do  not  take  part  in  the  engagement. 
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pEngage  _  pPre 
^INACTIVE  ~  ^INACTIVE 

pEngage  _  pPre 
^UNKNOWN  “  ^UNKNOWN 

pEngage  _  pPre 
^DEAD  ~  ^DEAD 


Post-engagement  transition  accounts  for  deactivating/hiding  behaviors  after  engagement. 


pPost  __  pEngage  _(  p 
^ACTIVE  “  ^ACTIVE  Vl 


DEACTIVATE 


PrideY 


^  •  p 

FAlLlATTACK  ^ATTACK 


pPost  _  p  p  p 

^INACTIVE  ““  ^DEACTIVATE  *  ^FAIL\ATTACK  *  ^ATTACK  ^INACTIVE 

pPosi  —  P  P  .  P  4-  pEngage 

^UNKNOWN  ““  ^HIDE  ’  ^FA1L\ATTACK  ^  ATTACK  '  ^  UNKNOWN 


where  Pdeacrvate  ^hide  -  ^  • 


If  the  threat  is  UNKNOWN,  REPAIRABLE,  ox  DEAD,  there  is  no  post-engagement 


transition. 


pPost  _  pEngage 

^REPAIRABLE  ~  ^REPAIRABLE 

pPost  _  pEngage 
^DEAD  ~~  ^DEAD 


We  combine  pre-engagement,  engagement,  and  post-engagement  transition  probabilities 
into  a  concise  matrix  representation.  First,  the  transition  probabilities  are  aggregated. 


=  (!-/> 


deactivate  ' 


p  .p  +  1 

^ ATTACK\ACTIVE  ^  ACTIVE  ^ 

'  Pride  )  ‘  PFAIL\ATrACK  ‘  PaTTAC¥\INACTIVE  ’  PiNACJTVE  +  (l  -  PATrACK\ACTIVE  ) '  Pactive 

p  •  p 

ATTA CK\UNKNOWN  ^  UNKNOWN 


P  •  P  + 

ATTACf^ACTIVE  ^  ACTIVE  ^ 


^INACTIVE  ~  PdEACTIVATE  *  PpAILlATTACK  *  PaTTACKIINACTIVE  *  PlNACTIVE 

P  .  P 

ATTACKpNKNOWN  ^UNKNOWN 


^ATTACK\INACTIVe)  ‘  PlNACTIVE 


P  .  P  4- 

^  ATTACKl ACTIVE  ^  ACTIVE  ^ 


^REPAIRABLE  {}  PdEAD\SUCCESs)  '  PsUCCESS\ ATTACK  '  Pa7TACK\INACTIVE  '  PlNACTIVE  PrEPAIRABLE 


P  .  P 

ATrACK\UNKNOWN  ^UNKNOWN 


P  •  P  4- 

^ATTACKl  ACTIVE  ^  ACTIVE  ^ 


^UNKNOWN  —  Pride  *  Pfail\ attack  ’  ^attack\inactive  ’  ^inactive 


-(l-p 


ATTACK\UNKN0WN )  ’  ^UNKNOWN 


p  •  p 

,  ^ ATTACK\UNKN0WN  ^  UNKNOWN 


(  P  .  P  4- 

^ ArTACK\ACTTVE  ^  ACTIVE  ^ 

PdEAD  ~  PdEAD\SUCCESS  '  PsUCCESS\ATTACK  ’  P AnACK\lN ACTIVE  '  PlNACTIVE  PdEAD 

P  .  P 

ATTACKpNKNOWN  ^  UNKNOWN  y 


This  set  of  equations  may  be  concisely  represented  in  matrix  form  as  follows 
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-  P 

""  ^REPAIRABLE 

P 

^UNKNOWir 


^Activ^Active  “inactiv^  Active  ^Repairable]  Active  Unknown\Active  ^Dead\  Active 

p  p  P  P  P 

^Activ^Inactive  ^Inactive\lnactive  ^Repairable\lnactive  ^Unknown\lnactive  Dead\lnactive 

^  ^  ^Repairable\Repairable  ^  ^ 

^Activepnknown  ^lnactive\Unknown  ^Repairable\Unknown  ^Unknown\Unknown  ^Dead\Unknown 

0  0  0  0  1. 


where  the  individual  terms  are  defined  as  follows: 


^Active\Active  ""  ^DEACTIVATE 


■  ^HIDE  )  ■  ^ 


>  .  P  + 

FA1L\ATTACK  ^  ATTACK\ACTIVE  ^ 


ATrACK\ACTIVE  j 


^Inaclive\Aclive  ~  PdEACUVATE  '  ^FA^ATTACK  ’  ^ATTACK\ACnVE 
VRepairabte] Active  “  “  Vjy^AD\SUCCESs)  '  ^SUCCESS\A7TACK  '  VatTACI^ ACTIVE 

^Unknown\Aclive  ~  Vuijje  '  PpAILlATTACK  '  VatTACK\ACTIVE 
^Dead\Aclive  ~  Vj)EAD\SUCCESS  '  VsaCCESS\ATTACK  '  ^ATTACK\ACTIVE 


■  Active\lnactive 


DEACTIVATE 


I^hide)  ‘ 


FAIL\ATrACK  '  ATTACK\INACTIVE 


^ ’naclive\lnaclive  ~  ^DEACTIVATE  '  ^FAIL\ATTACK  '  ^ATTACK\mcnVE  ^ATTAC 

Iunknown\lnactive  ~  ^HIDE  '  ^FAIL\ATrACK  '  I^ATrACK\lNACTIVE 

^Repairabl^Inactive  “  “  ^DEAD\SUCCES^  '  ^SUCCESS\ATrACK  '  ^ATTACK\IN ACTIVE 

^Dead\lnactive  ~  Vj)^AD\SUCCESS  '  ^SUCCESS\ATrACK  '  I*ATTACK\INACTIVE 


A7TACK\INACTIVE  I 


P  =1 

^  Repairable\Repairable 
^Active\Unknown  ^DEACTIVATE 


^HIDE  )  ' 


FAJL\ATTACK  '  ^ATTACK\UNKNOWN 


^Inactiveptiknown  ~  ^DEACTIVATE  ‘  ^FAIL\ATTACK  ‘  ^ATTACKpNKNOWN 


^Repairable]  Unknown 


=(i-^ 


DEAD\SVCCESS  j  '  ^SUCCESS\ATTACK  ’  ^ ATTACK\UNKNOWN 


I^Vnknown\Unknown  ~  ^^HIDE  '  I*FAIL\ATTACK  '  ^ATTACKpIKNOWN 


I^ATrACK\UNKNOWN) 


I^Deadpnknown  ~  VdeAD\SUCCESS  '  ^SUCCESS\ATTACK  ‘  VattACI^UNKNOWN 
Alternately,  PActive\Active^  •^nacWvel/nacHVc’  ■^e/)a/>aWe|Ae/)aiTO6/e’ I*Unknown\Vnknown  Ht^y  ^6  defined  baSed  OH 


the  other  terms  such  that  each  row  sums  to  1 .  The  parameters  of  this  model  are  as  follows: 


^DEACTIVATE  the  probability  that  an  “attacking”  threat  will  deactivate  after  the 


*  ^DEACTIVATE 

engagement; 


TR-1048 


21 


11/30/2001 


ALPHATECH.  Inc. 


Use  or  disclosure  of  data  marked  by  an  asterisk  C) 
is  subject  to  the  restrictions  on  the  cover  page. 


•  PfODE  is  the  probability  that  an  “attacking”  threat  will  hide  (become  unknown)  after  the 
engagement; 

•  ^FAiL\ATTACK(^>^hngage)  is  thc  probability  that  the  threat  survived  (i.e.,  suppression  failed) 

the  engagement  and  is  a  function  of  the  engagement  time,  and  the  probabilistic 

state  of  the  associated  air  package  (Discussed  further  in  following  paragraphs); 

•  ^SUCCESS\A7TACK{^’^^Engage)  —  I""  r*FAIL\ATTACK{^’^^ Engage)’ 

•  ^DEAD\SUCCESS  probability  that  a  successful  threat  engagement  will  destroy  the  threat; 

•  ^ATTACK\AcnvE  i^  probability  that  an  active  threat  will  attack; 

•  ^ATTACKimACTivE  probability  that  an  inactive  threat  will  attack;  and 

•  ^ATrACK\uNKNowN  probability  that  an  unknown  threat  will  attack. 

Consider  Pfail\attack{^’^^ Engage)  ’  which  depends  on  the  state  of  the  associated  air  package 
7U  and  the  engagement  time  This  represents  the  actual  engagement  dynamics  and  is 

based  on  an  aggregate  model  of  the  shoot-look-shoot  behavior  of  both  the  air  package  and  the 
threat.  In  this  model,  the  uncertainty  associated  with  the  threat  engagement  depends  on  the 
composition  of  the  air  package,  the  state  of  the  threat  (as  a  result  of  previous  missions),  and  the 
duration  of  the  engagement  (determined  from  the  geometry).  The  simplified  geometry  of  this 
interaction  is  shown  in  Figure  5.  The  circle  represents  the  footprint  of  the  SAM’s  lethal  range. 
Any  route  through  the  lethal  range  of  the  threat  will  result  in  an  exposure  time  that  indicates  the 
duration  of  the  engagement,  AT .  The  state  prior  to  the  engagement  is  denoted  ,  and  the  final 

state  is  denoted  .  The  engagement  d5mamics  are  defined  within  a  composite  model  of  the 
interacting  air  package  and  threat. 
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Base 


Figure  5  SAM  Engagement  Model  Geometry, 

The  underlying  dynamics  of  the  threat  engagement  are  represented  by  a  set  of  event  (e.g., 
SAM  fires,  Weasel  fires)  that  are  related  by  a  continuous-time  Marko^  chain,  which  may  be 
solved  analytically  using 

TCp 

where  Qisa  transition  rate  matrix  describing  the  engagement  dynamics.  Q  is  constructed  from  a 
set  of  parameters  that  describe  the  effectiveness  of  the  SAM  against  each  of  the  aircraft  in  the 
package,  and  the  effectiveness  of  the  weasel  aircraft  against  the  SAM.  The  following  table 
summarizes  these  parameters 


Parameter 

Symbol 

Notional  Value 

Weasel  firing  rate 

y  weasel 

1  shot  every  6  min  (shoot-look-shoot) 

SAM  firing  rate 

7  SAM 

1  shot  every  6  min  (shoot-look- shoot) 

Weasel  effectiveness 

pSAM 

80% 

SAM  effectiveness 

p  Strike 

90% 

SAM  effectiveness 

P  Strike 

70% 

In  this  formulation,  the  firing  rates  account  for  acquisition,  tracking,  munition  flyout,  and 
reload/tum.  The  effectiveness  accounts  for  initial  detection  and  munition  effectiveness.  This 
model  may  be  tuned  and  is  easily  extended;  however,  the  desirable  behavior  has  been  observed 
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with  only  notional  parameter  values.  Given  these  parameters,  the  sparse  Q  transition  rate  matrix 


is  represented  as  follows: 


0 

I  „  pWeasel 

I  /sm^D 

0  I 
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/sam-Cd 
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0 . _ 

3.3  HYBRID  DYNAMICAL  MODEL  IMPLEMENTATION 

Having  defined  the  hybrid  dynamical  framework,  individual  asset  dynamics,  and  the 
parallel  composition  of  these  dynamical  entities  through  interaction  dynamics,  the  plant 
dynamics  were  implemented  in  ALPHATECH’ s  BMC^  Development  Environment,  whose 
graphical  user  interface  is  illustrated  in  Figure  6  below.  This  is  a  discrete  event  simulation 
environment,  which  is  written  in  the  JAVA  programming  language.  JAVA  provides  seamless 
portability  across  multiple  hardware  and  software  platforms,  ease  in  process  distribution,  and 
programming  efficiency  due  to  the  language’s  inherent  object  oriented  structure. 

The  plant’s  object  oriented  design  provides  a  flexible  environment  for  adding  and 
modifying  simulation  objects  and  events.  The  simulation  objects  contain  individual  information 
and  interact  as  separate  entities.  Likewise,  the  events  encapsulate  specific  behaviors  and 
dynamics  in  the  simulation,  providing  their  own  execution  logic.  This  makes  for  a  flexible 
system  which  may  be  easily  extended  or  modified  without  changing  the  existing  simulation 
framework. 
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863.5059  -  IntersectlAD  -  Aii:Package_l  SAH3ite_l  356.65 
■866.1213  -  IntersectlAD  -  XirPaclfage^l?  SAHSlce_3  42.08 
'867.1670  -  IntersectlAD  -  Aii:Paclcage_ll  SAHSite_2  2i: 

960.0762  -  Inc-ersectlAD  -  Xii:Paclcage_9  SAHSite_2  124.|^VCIlt 
960.0762  -  IntecsectlAD  -  XirPac)cage_14  SAHSitc_2  12<.-- 
.  961.2374  -  ActivateSXH  -  3AH3ite_10 

:  1008.7940  -  InterseccIAD  -  AicPacltagc_16  SAH3iCe_4  148,32 

iOJ18.,  7548 .  ™  Jxxter  seatl^.  Alrf&ckMt.  15  A.  1294 

:’Simul8rtlonpftLis«d.  ' 


iCersectlAD  -  AlrPackage_3  SAHSlte_2  288.71 
I-;]  781.2397  -  IncerseccIAD  -  AlrPackage_6  3AH3ite_2  288.71 
781.2397  -  IncersectlAD  -  AirPackage_7  SAHSite_2  288.71 
-ii-i,  797.6666  -  HideSAH  -  SAH3ite_5 


Figure  6  ALPHATECH* s  BMCf  Development  Environment  GUI 
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3.3.1  Discrete  Event  Simulation 

ALPHATECH’s 

BMC^  Development 
Environment  is  based  on 
the  discrete  event 
architecture  illustrated  in 
Figure  7  [CassLaf99].  At 
the  heart  of  the  simulation 
is  the  event  calendar,  a 
time-sorted  list  of  events 
waiting  to  be  executed. 

The  simulation  is  driven 
ahead  in  time  by 
continually  removing  the 
next  event  from  the  (copied form  Reference[ CassLaf99]) 

calendar,  updating  the  simulation  time  to  that  of  the  event  by  updating  any  continuous  variables 
(such  as  position  of  air  packages),  and  finally  processing  the  event.  The  event  is  created  with  the 
knowledge  of  which  simulation  objects  are  involved  in  it,  and  any  other  information  that  may 
dynamically  affect  its  behavior.  Therefore,  when  the  simulator  instructs  it  to  do  so,  the  event  can 
execute  itself  according  to  its  inherent  dynamics,  using  the  current  state  of  the  simulation. 
Depending  on  the  specific  event’s  execution,  it  may  update  the  state  of  the  simulation  by 
changing  the  state  of  any  relevant  simulation  objects,  adding  new  objects  to  the  simulation, 
removing  objects,  or  by  adding  new  events  to  the  calendar.  This  process  continues  until  the 
event  calendar  is  empty  or  any  user  defined  stop  conditions  are  met. 

3.3.2  Plant/Controller  Interface 

When  an  event  is  executed  that  requires  guidance  from  the  controller,  the  plant 
accomplishes  this  via  an  interface  to  the  controller.  This  interface  between  the  plant  and  the 
controller  maintains  the  integrity  of  each  as  separate  software  entities.  When  a  control  decision 
is  needed,  it  is  the  plant’s  responsibility  to  collect  the  current  known  or  estimated  state  of  all 
simulation  objects  into  a  data  stmcture  required  as  input  to  the  controller,  which  is  sent  via  the 
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Figure  7  Discrete  Event  Simulation  Framework 
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interface.  Depending  on  arguments  specified  by  the  user,  the  information  passed  to  the 
controller  may  be  collected  by  the  plant  in  two  different  ways.  Most  experiments  performed 
assume  perfect  state  information  about  all  objects  in  the  simulation.  In  this  case,  the  plant 
extracts  the  true  state  of  all  objects  from  the  simulation  to  create  the  data  structure  sent  to  the 
controller.  When  modeling  information  collection,  however,  a  data  structure  is  maintained  with 
the  current  estimated  or  observed  state  of  all  simulation  objects,  which  gets  updated  at  the 
appropriate  times  by  all  surveillance  aircraft.  In  this  case,  a  new  deep  copy  of  this  data  structure 
is  made  by  the  plant  and  passed  via  the  interface  to  the  controller.  It  is  then  the  controller’s 
responsibility  to  return  its  guidance  to  the  plant  via  the  interface,  using  the  same  type  of  data 
structure  it  was  passed.  However,  since  the  controller  only  has  the  ability  to  affect  the  actions  of 
air  packages,  only  that  type  of  object  is  returned  to  the  plant.  If  the  controller  decides  to 
configure  and  task  new  air  packages,  objects  are  returned  to  instruct  the  plant  how  to  do  so. 

When  it  needs  to  modify  the  missions  of  existing  air  packages,  any  changes  will  be  reflected  in 
the  corresponding  air  package  objects’  missions  returned  as  guidance.  The  plant  must  then 
incorporate  those  changes  into  the  real  air  packages.  The  controller  does  not  have  the  ability  to 
modify  or  create  any  of  the  simulation  objects  itself,  but  rather  instructs  the  plant  how  to  do  so. 

As  described  above,  the  plant  and  controller  have  a  well  defined  client/server 
relationship.  The  plant’s  simulation  runs  as  a  client,  which  calls  the  controller  as  a  server  when 
it  needs  guidance.  Each  is  a  separate  software  entity,  although  they  do  not  have  to  be  run  as 
separate  processes.  In  fact,  the  plant  and  controller  almost  always  run  in  the  same  process  when 
performing  experiments  to  save  the  overhead  of  communicating  over  a  network.  The  interface 
described  above  makes  it  easy  to  maintain  this  relationship,  as  represents  the  network  between 
the  client  and  server. 

3.3.3  Distributed  Processing 

Some  of  the  controller  algorithms  implemented  require  a  great  deal  of  computation  time. 
No  matter  how  streamlined  and  efficiently  the  code  is  written,  it  is  just  infeasible  to  use  only  one 
process  to  get  some  control  decisions  due  to  the  computational  requirements.  This  becomes  even 
more  of  a  problem  as  the  size  of  the  problem  being  solved  grows.  One  way  of  dealing  with  these 
issues  of  computation  time  and  problem  scalability  is  to  distribute  the  controller’s  processing 
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duties  among  multiple  CPUs,  thereby  decreasing  the  overall  time  it  takes  to  close  the  loop.  This 
can  be  done  in  a  variety  of  ways,  depending  on  how  a  specific  controller  works. 


_ J 

PLANT 

_ I 

[ _ 

INTERFACE 

Figure  8  Distribution  of  Controller  Processing  Across  Multiple  Platforms 


One  example  of  a  controller  that  benefits  from  distributing  its  processing  is  the  Maximum 
Marginal  Return  (MMR)  algorithm  (see  Section  4.3.1).  Before  allocating  each  aircraft,  the 
MMR  must  evaluate  the  outcome  of  adding  it  to  every  air  package  in  the  current  mission  queue 
to  determine  how  the  asset  can  best  be  used.  Rather  than  calling  its  predictor  object  to  estimate 
the  performance  expectations  one  option  (asset  to  air  package  assignment)  at  a  time,  the  MMR 
can  evaluate  each  option  in  parallel  by  distributing  its  predictor  calls  over  multiple  processors. 

In  this  case,  if  twenty  options  needed  to  be  evaluated  and  ten  machines  were  available, 
distributing  the  processing  should  be  ten  times  faster  than  evaluating  all  twenty  options  in  a 
single  process. 
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4  JAO  CONTROLLER  FORMULATION 

In  this  chapter,  the  JAO  controller  formulation  details  will  be  presented.  The  discussion 
begins  with  a  general  discussion  of  the  control  framework  used  for  this  research.  Then,  some 
proof  of  concept  experimental  results  that  guided  our  controller  research  framework  will  be 
presented.  Having  established  the  control  and  research  framework,  the  details  of  the  controller 
formulations  and  implementations  will  be  presented.  At  the  end  of  this  chapter,  a  scalability 
analysis  will  be  presented  to  assess  whether  near  real-time  computation  performance  is  achieved. 
4.1  CONTROL  FRAMEWORK 

The  JAO  environment  is  a  uncertain  dynamical  system  that  has  the  following  attributes: 
control  decisions  made  over  time;  probabilistic  transition  from  one  state  to  the  next,  which  is 
dependent  on  the  choice  of  control;  and  rewards/costs  that  are  accumulated  during  each 
transition,  which  is  dependent  on  control  and  state  transition  outcome.  Thus,  the  tasking  of  air 
packages  in  a  JAO  environment  can  be  viewed  as  a  sequential  decision  problem  where  each 
decision  is  based  on  the  observations  of  certain  discrete  events. 

This  class  of  problems  can  be  formulated  as  a  Markov  decision  problem  [B96].  The 
principal  approach  for  solving  such  problems  is  Stochastic  Dynamic  Programming  (SDP).  Using 
the  SDP  formulation,  an  optimal  control  solution  is  computed  off-line,  and  on-line  computation 
is  reduced  to  feedback  rule  evaluation  or  table  lookup  interpolation.  However,  it  is  well  known 
that  this  approach  suffers  from  the  curse  of  dimensionality  and  is  intractable  for  realistic  sized 
JAO  problems. 

A  subtle  but  significant  attribute  of  the  SDP  formulation  is  that  it  produces  control 
strategies  that  anticipate  the  effects  of  future  contingencies,  and  evaluates  the  possible  actions  at 
all  future  states.  The  algorithm  accomplishes  this  by  modeling  the  future  information  arrival  and 
control  decisions.  It  is  this  fact  that  produces  proactive  versus  reactive  control  behaviors.  This 
proactive  attribute  is  imperative  for  stable  and  agile  control  of  the  JAO  enterprise  because  future 
information  arrival  and  control  opportunities  are  dependent  on  stringent  spatial,  temporal,  and 
coordination  constraints. 

Given  the  strengths  and  weakness  of  the  SDP  formulation,  there  has  been  a  great  deal  of 
research  on  Approximate  Dynamic  Programming  (ADP)  methods  in  recent  years.  These 
methods  generally  maintain  the  SDP  structure,  but  use  a  variety  of  techniques  to  approximate  the 
optimal  cost-to-go.  Accordingly,  ADP  algorithms  have  been  applied  to  a  variety  of  dynamic 
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decision  problems  [B99],  [BTW97],  [Patek99],  [BC98],  [BC99].  The  goal  of  this  research  is  to 
extend  these  ADP  algorithms  to  the  JAO  context.  In  the  following  paragraphs,  the  general  SDP 
and  ADP  formulations  will  be  presented. 

Consider  a  discrete-time  version  of  a  dynamic  decision  problem, 

Xk+I  =fk(xic»Uk»wJ 

where  Xk  is  the  state  taking  values  in  some  set  Ai,  Uk  is  the  control  to  be  selected  from  a  finite  set 
Uk(xk),  Wi  is  a  random  disturbance,  and  fk  is  a  given  function.  We  assume  that  the  disturbance  Wk, 
k=0,l,...  has  a  given  distribution  that  depends  explicitly  only  on  the  current  state  and  control. 
Define  a  control  policy,  which  is  a  sequence  of  feedback  functions  that  map  each  state  Xk  to 
control  Uk'. 

~  {/^k  k  )» Ak+1  k+1  )»•  •  •  >  Ak+N-1  k+N-1  )} 

thus,  the  control  at  time  k  is  u^  =  Ak  (x k  )€  U,,  (x  J .  In  the  A-stage  horizon  problems  considered 
herein,  the  single-stage  cost  function  is  denoted  bygk(x^,Ak(xk)>Wk)  and  the  terminal  cost 
function  is  denoted  by  0,,^^  (^k+N )  •  The  cost-to-go  for  policy  n  starting  from  state  Xk  at  time  k 
can  be  computed  as  follows: 

{k+N-l  ] 

G  k+N  (x  k+N  )  +  S  g  i  i  ’  )’  )| 

and  can  be  represented  in  the  SDP  recursion  format  as  follows 
(x  J  =  E{g,  (x„  Ak  (jc  J,  wj  +  (fk(xk,  Ak  {xk ).  Wk ))} 

for  all  k  and  with  the  initial  condition 

Jk+N  (^k+N)“ 

The  A-stage,  SDP  solution  is  as  follows 

^k=arg  min  E{gk(xk,Ak(^k)>Wk)+JL(fk(xk»Ak(xk),wJ)} 
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The  computational  feasibility  of  the  SDP  j*  (x  ) 

recursion  depends  on  the  number  of  future  state 
realizations  required  to  describe  the  system.  Figure  9 
provides  a  graphic  illustration  of  the  SDP  recursion 
and  tree  structure  of  possible  future  state  realizations. 

To  illustrate  the  number  of  states  required,  assume  that 
there  are  targets,  Mair  packages,  and  that  we 
simplify  physical  position  descriptions  to  describe 
only  the  N  positions  of  the  targets.  Then,  the  number 
of  possible  combinations  of  positions  is  Af,  and  the  p  Recursion 

number  of  possible  uncollected  target  sets  at  a  given 

time  is  2^,  resulting  in  numbers  of  states  (2M)’^.  For  modest  numbers  of  assets  and  targets,  the 
number  of  states  far  exceeds  our  capability  for  computing  and/or  storing  the  resulting  optimal 
decision  rules. 

In  an  attempt  to  overcome  the  SDP  curse  of  dimensionality,  the  ADP  algorithm  replaces 
the  control  mapping  for  times  k+l^k+N-1  with  some  approximate  mapping 

Additionally,  the  ADP  algorithm  is  solved  forward  in  time,  and  is  computed  at  the  actual  state  Xk 
versus  all  possible  states  at  time  k.  As  Appendix  8.1  presents,  there  are  a  variety  of  approaches 
to  approximating  future  control  maps.  The  approach  adopted  for  this  research  is  to  generate  an 
approximate  control  policy  that  maps  a  subset  of  future  state  realizations  to  a  subset  of  control 
actions. 

Thus,  the  ADP  algorithm  has  the  following  policy: 

—  {^k  (^k  )» Ak+l  (^k+1  )’•••»  /^k+N-1  (^k+N-1  )} 

Using  this  policy,  the  approximate  optimal  control  solution  at  time  k  is 
u^^  =argminE{gk(xk./^*(^J.>^k)+Jkr  (fk(xk.«k(^J»i^J)} 
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Thus,  the  ADP  policy  is  a  one  step-look-ahead 
policy  with  the  optimal  cost-to-go  approximated  by 
the  cost-to-go  of  the  base  policy.  The  ADP  algorithm 
computes  the  best  control  at  the  current  state  at  time 
k  by  balancing  the  current  cost  with  an  approximate 
cost-to-go  using  approximations  to  model  future 
control  decisions.  A  graphical  illustration  of  the  ADP 
approach  is  illustrated  in  Figure  10  where  the  tree 
structure  of  the  cost-to-go  has  been  replaced  with  an 
approximate  cost-to-go. 

By  approximating  the  future  control  maps  and 
by  solving  the  problem  forward  in  time  for  the  actual  state  Xk,  the  ADP  algorithm  significantly 
reduces  the  computational  complexity  of  the  SDP  framework.  Again,  SDP  considers  all  of  the 
possible  states  and  computes  a  tentative  decision  for  each  possible  state,  whereas  ADP  only 
computes  decisions  for  states  that  actually  occur  in  the  scenario.  Thus,  the  number  of  states 
considered  by  ADP  is  much  smaller;  however,  the  drawback  of  this  approach  is  the  solution  must 
be  determined  in  real-time. 

Having  defined  the  ADP  framework,  the  difficulty  remains  of  computing  the  expectation 
in  the  above  optimization.  Given  the  complexity  of  the  JAO  problem  and  the  fact  that  the 
control  solution  will  have  to  be  computed  in  real-time,  only  an  estimate  of  the  expectation  can  be 
computed.  Accordingly,  the  g-factor  is  introduced: 

e(x„  M J  (x  k ,  ),  w* ) + J Cx  (f k  (x  k .  w*  {Xk  \  ^k  ))} 

For  the  reasons  stated  above,  only  an  estimate  for  the  ^-factor  Q(x,^  ,Ui )  is  obtained. 
Thus,  given  the  estimate  g-factor  Q(x^  ,Ui )  corresponding  to  each  candidate  control 
G  (x^ ),  the  ADP  control  at  time  k  for  state  Xk  is 
=  arg  jnin  E{Q(xk.w*(^* ))} 

In  summary,  the  ADP  algorithm  provides  the  control  framework  for  this  research.  This 
technique  has  been  shown  to  illustrate  operationally  consistent,  proactive  behaviors  for  relevant 
military  applications.  The  focus  of  this  research  is  to  extend  the  ADP  method  to  the  problem  of 


X  fe.,) 


I - p. 


Figure  10  NDP  Solution  Structure 
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JAO,  and  in  doing  so,  develop  ADP  algorithms  that  exhibit  optimal  behavior  in  real-time  or  near 
real-time  for  realistic  JAO  scenarios.  As  a  final  note,  given  the  approximation  illustrated  above, 
the  ADP  algorithm  is  not  as  ambitious  as  the  SDP,  and  only  provides  modest  guarantees  of  near¬ 
optimality  [BTW97];  in  fact,  it  is  an  intermediate  methodology  between  Model  Predictive 
Control  (MPC)  and  SDP. 

4.2  PROOF  OF  CONCEPT  EXPERIMENTS  AND  RESEARCH  FRAMEWORK 

In  this  section,  the  proof  of  concept  experiments  that  guided  the  development  of  ADP 
control  techniques  for  the  JAO  problem  will  be  presented.  Following  the  proof  of  concept 
experiment  discussion,  the  research  framework  adopted  for  this  program  will  be  presented. 

4.2.1  Proof  of  Concept  Experimental  Results 

At  the  beginning  of  this  research  program,  proof  of  concept  experiments  were  performed 
for  the  purpose  of  identifying  key  technology  gaps  in  the  state-of-the-art  of  ADP  with  respect  to 
the  JAO  domain.  By  identifying  the  key  technology  barriers,  the  research  and  development  was 
focused  on  mitigating  these  barriers  so  as  to  satisfy  the  research  objective  of  providing  military 
commanders  with  real-time,  near  optimal  control  strategies  for  air-to-ground  operations.  In  this 
section,  the  proof  of  concept  experimental  results  that  guided  the  development  of  ADP  control 
techniques  for  the  JAO  problem  will  be  presented.  Since  a  majority  of  these  initial  experimental 
results  have  been  documented  in  conference  publications,  most  of  the  details  are  contained  in  the 
Appendices  and  the  summary  of  results  and  lessons  learned  will  be  summarized  here. 

The  approach  used  to  establish  the  proof  of  concept  experiments  was  to  apply  ADP 
techniques  that  have  been  applied  to  other  relevant  military  problems.  As  part  of  AFOSR’s  New 
World  Vistas  (NWV)  program,  ALPHATECH  performed  basic  research  on  ADP  control 
techniques  for  the  problem  of  sensor  asset  scheduling.  Under  this  program,  ADP  control 
techniques  were  developed  that  optimize  the  collection  of  data  by  multiple  sensor  platforms 
based  on  requests  of  multiple  end-users.  The  controller  dynamically  replans  the  paths  of  sensor 
platforms  as  the  result  of  dynamic  requests  for  data,  failure  of  individual  sensors  (thereby 
providing  fault  tolerance),  and  failure  of  sensors  to  collect  individual  pieces  of  data  due  to 
unpredictable  obscuration  effects  such  as  weather.  Many  of  the  results  from  this  program  are 
contained  in  two  DARPA  Advances  in  Enterprise  Control  papers  that  appear  in  Appendices  8.2 
and  8.3. 
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Given  this  successfiil  application  of  ADP  techniques  to  military  relevant  problems,  the 
next  logical  step  was  to  apply  the  ADP  techniques  developed  under  the  AFOSR  program  to  the 
problem  of  orchestrating  a  24  hour  air-to-ground  campaign  in  a  risky  environment.  However, 
the  multi-vehicle  scheduling  problem  does  not  map  one-to-one  with  the  air-to-ground  problem 
because  this  problem  is  much  larger  and  contains  a  richer  set  of  dynamics.  For  one,  this  problem 
requires  the  formation  and  tasking  of  air  packages  in  an  environment  where  risk  and  reward 
depend  on  the  air  package  composition.  Furthermore,  since  air  packages  pose  a  risk  to  enemy 
assets  and  enemy  assets  pose  a  risk  to  air  packages,  considerable  coupling  exists  between 
battlespace  assets.  Given  the  fact  that  multiple  turns  of  the  aircraft  will  be  required  to  achieve 
the  operational  objectives,  this  coupling  remains  dominant  through  the  air-to-ground  campaign. 

For  the  proof  of  concept  experiments,  the  rollout  algorithm  [B99],  [BTW97]  was  chosen 
for  implementation  on  a  reduce-order  JAO  problem.  The  rollout  algorithm — which  has  been 
used  for  a  wide  variety  of  dynamic  decision  problems  [B99],  [Patek99],  [BC99],  [BC98],  [BC99] 
—  is  a  technique  that  exploits  knowledge  of  a  suboptimal  decision  rule  to  obtain  an  approximate 
cost-to-go  for  use  in  ADP  framework.  The  rollout  algorithm  approximates  control  mapping  for 
times  k+1  k+N-1  with  a  baseline  heuristic  ) .  Thus,  the  rollout  algorithm  computes  the 


best  control  at  the  current  state  Xk  at  time  k  by  balancing  the  current  cost  with  an  approximate 
cost-to-go  using  a  baseline  heuristic  to  model  future  control  decisions.  To  generate  the  estimate 
of  g-factor,  Monte  Carlo  evaluations  were  performed  by  simulating  the  base  policy  in  real-time. 


i.e.  simulation-in- the-loop. 

The  rollout  algorithm  is  applied  to 
a  small  JAO  scenario,  Figure  11,  that 
includes  limited  assets,  risk/reward  that  is 
dependent  on  package  composition,  basic 
threat  avoidance  routing,  and  multiple 
targets,  some  of  which  are  fleeting  and 
emerging.  Simulation  results  illustrate  the 
benefits  of  the  approximate  optimal 
control  strategy.  It  is  shown  that  the 
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rollout  strategy  provides  statistically  significant 
performance  improvements  over  strategies  that  do 
anticipate  future  information  arrival  and  control 
decisions.  The  performance  improvements  were 
attributed  to  the  fact  that  the  rollout  algorithm  is 
able  to  learn  near-optimal  behaviors — establishing 
combat  air  patrol  over  time  critical  areas,  staging 
packages  and  opening  attack  corridors  to  manage 
friendly  asset  attrition,  aggressively  prosecuting 
fleeting  targets,  and  reserving  assets  for 


Figure  12  Proof  of  Concept  Performance 
ADP  for  Reduced-Order  JAO  Scenario 


contingencies  that  are  not  modeled  in  the  baseline  heuristic.  The  details  of  this  experiment  and 


the  results  obtained  are  contained  in  Appendix  8.4. 


Having  shown  that  ADP  strategies 
can  produce  operationally  consistent, 
proactive  control  solutions,  the  question 
remains  whether  this  current  ADP 
implementation  using  rollout  is  feasible 
for  a  realistic  sized  JAO  scenario.  Figure 
13  illustrates  the  scalability  assessment 
performed  on  this  ADP  implementation. 
From  this  figure,  it  is  seen  that  that  real¬ 
time  computation  performance  is  not 
feasible  for  larger  scenarios.  Furthermore, 


—  AT  Computation  Capability 

—  High-Performance  Computing 


when  considering  a  richer  set  of  dynamics.  Figure  13  Scalability  Assessment  of  Proof  of 

^  j ,,  , ,,  ^  Concept  ADP  Implementation 

It  IS  expected  that  the  computation  ^ 


complexity  will  grow  by  several  orders  of  magnitude. 
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From  this  proof  of  concept  experiment,  the  following  lessons  were  learned  relative  to 
JAO  control: 

•  Proactive  and  Reactive  Control  provides  near-optimal  performance  in  situations  with 
abundant  opportunity  and  time  to  react  to  uncertain  future  information  arrival; 

•  Proactive  Control  provides  near-optimal  performance  in  situations  with  limited 
opportunity  and/or  time  to  react  to  uncertain  future  information  arrival: 

-  High  attrition  environment 

-  Control  response  delays,  i.e.  inertia,  >  significant  event  time-scales 

-  Information  delay 

•  Key  Proactive  JAO  Behaviors:  Positioning  assets  now  for  future  opportunity 

-  Preparing  battlespace 

-  Reserving  assets 

-  Geographically  positioning  assets 

•  Computational  Performance: 

-  ADP  using  combinatorial  rollout  to  search  control  space  and  temporal  rollout,  i.e. 
simulation-in-the-loop,  for  prediction  is  infeasible  for  100  DMPI  scenario 

-  The  reactive  controller  implemented  provides  rapid  replanning,  and  is  feasible  for  100 
DMPI  scenario  and  has  considerable  margin 

In  summary,  ADP  technique  known  as  rollout,  which  was  developed  under  the  AFOSR 

NWV  program,  was  applied  to  a  reduced-order  JAO  scenario  that  includes  limited  assets, 

risk/reward  that  is  dependent  on  mission  composition,  basic  threat  avoidance  routing,  and 

multiple  targets,  some  of  which  are  fleeting  and  emerging.  Simulation  results  illustrate  the 

benefits  of  the  ADP  control  strategy.  It  is  shown  that  the  proactive  ADP  strategy  provides 

statistically  significant  performance  improvements  over  a  reactive  feedback  strategy  by 

developing  operationally  consistent  control  strategies  that  anticipate  likely  contingencies  and 

position  assets  for  opportunities  of  recourse.  However  promising  these  results  are,  the  current 

implementation  of  the  rollout  algorithm  is  not  computationally  feasible  for  realistic  JAO 

scenarios. 

4.2.2  Control  Research  Framework 

The  proof  of  concept  experiments  in  the  above  section  highlighted  ADP  performance 
both  in  terms  of  behaviors  and  computation  complexity.  It  was  shown  that  ADP  control 
strategies  could  produce  proactive,  operationally  consistent  behaviors  but  scalability  remains  the 
key  technical  barrier  of  using  this  technique  for  realistic  sized  JAO  problems.  As  a  result,  the 
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bulk  of  this  research  was  devoted  toward  reducing  the  computational  complexity  of  the  ADP 
approach  while  maintaining  the  operationally  consistent  behaviors.  Based  on  the  lessons 
learned,  a  two  pronged  approach  for  reducing  the  problem  complexity  was  pursued: 

1 .  Reduce  control  problem  size  where  appropriate,  and 

2.  Improved  the  efficiency  of  the  ADP  algorithms. 

As  discussed  in  the  previous  section,  one  of  the  lessons  learned  was  that  both  reactive  and 
proactive  control  provides  near-optimal  control  strategies  in  situations  where  there  is  abundant 
opportunity  and  time  to  react  to  uncertain  future  information  arrival.  However,  the  reactive 
control  approach  can  produce  this  solution  in  a  fraction  of  the  time  that  it  takes  the  proactive 
controller  approach.  Thus,  in  situations  where  there  is  abundant  opportunity  and  time  to  react  to 
uncertain  future  information  arrival,  it  makes  sense  to  use  a  reactive  versus  proactive  control 
approach.  The  lessons  learned  also  identified  situations  in  which  a  proactive  control  approach  is 
required  in  order  to  proved  near-optimal  control  strategies.  These  situations  include 
environments  that  exhibit  high  attrition  and  significant  control  response  and  information  delays 
relative  to  the  battlespace  time  scale.  With  this  intuition  into  the  JAO  control  problem,  a  hybrid, 
multi-rate  control  architecture  was  adopted  that  tailors  the  application  of  control,  i.e.  reactive  or 
proactive,  at  a  time  scale  that  is  appropriate  to  the  battlespace  situation  at  hand. 

Figure  14  illustrates  the  hybrid,  multi-  JAO  Controller 

rate  architecture  chosen  for  this  research.  To 
implement  this  architecture,  three  types  of 
information  must  be  specified  a  priori.  First,  the 
significant  events  at  which  control  loop  closures 
are  to  occur  must  be  defined.  For  each 
significant  event,  a  set  of  control  algorithms 
must  be  defined,  and  finally,  the  loop  closure 
rate  for  each  significant  event  must  be  defined. 

In  this  figure,  event-based  loop  closures  are 
denoted  by  E  where  temporal-based  loop 

closures  are  denoted  by  T.  By  adopting  this  architecture,  the  control  problem  complexity  is 
substantially  reduced  for  situations  that  do  not  require  advanced  ADP  algorithms. 


Figure  14  Hybrid,  Multi-Rate  Control 
Architecture 
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The  implementation  of  this  control  architecture 
provided  an  immediate  reduction  in  the  control 
problem  complexity  for  certain  situations.  However, 
as  noted  in  the  lessons  learned,  situations  with  limited 
opportunity  and/or  time  to  react  to  uncertain  future 
information  arrival  require  proactive  control  strategies 
to  achieve  optimal  performance.  Thus,  for  these 
situations,  the  complexity  reduction  must  be  achieved 
by  developing  fast  and  efficient  ADP  algorithms  that 
still  exhibit  the  desired  operationally  consistent 
behaviors.  The  approach  adopted  for  developing 
efficient  ADP  algorithms  was  to  exploit  the  natural 
decomposition,  which  is  illustrated  in  Figure  15, 
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Predict  Multi-Stage 
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Future  Value 
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Figure  15  ADP  Complexity  Mitigation 
Approach 


between  the  control  space  search  and  the  performance  prediction  problems  of  the  ADP 
framework.  Given  this  natural  decomposition,  parallel  research  initiatives  that  focused  on 
reducing  the  complexities  of  these  problems  were  conducted.  Figure  16  highlights  some  of  the 
technologies  that  were  investigated  as  part  of  the  complexity  mitigation. 
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Figure  16  Enabling  Technologies  Investigated  to  Mitigate  ADP  Complexities  for  J  AO  Problem 
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In  summary,  the  proof  of  concept  experiments  showed  that  ADP  control  strategies  could 
produce  proactive,  operationally  consistent  behaviors  but  scalability  remains  the  key  technical 
barrier  of  using  this  technique  for  realistic  sized  JAO  problems.  As  a  result,  the  bulk  of  this 
research  was  devoted  towards  reducing  the  computational  complexity  of  the  ADP  approach 
while  maintaining  the  operationally  consistent  behaviors.  One  immediate  complexity  reduction 
was  achieved  by  adopting  a  hybrid,  multi-rate  control  architecture  that  tailors  the  application  of 
control,  i.e.  reactive  or  proactive,  for  the  battlespace  situation  at  hand;  however,  additional 
complexity  reduction  is  required.  Thus,  the  research  was  focused  on  developing  fast  and 
efficient  combinatorial  assignment  and  prediction  models  that  form  the  foundation  of  the  ADP 
algorithm.  Figure  17  illustrates  the  control  architecture  in  the  context  of  the  ADP 
decomposition,  i.e.  assignment  and  prediction.  It  is  the  goal  of  this  research  to  develop  a 
spectrum  of  ADP  control  strategies  that  can  be  mixed  and  matched  to  provide  a  broad  range  of 
performance  and  eomputational  complexity.  The  details  of  the  different  combinatorial 
assignment  and  analytic  prediction  models  developed  under  this  research  follows  in  the 
subsequent  sections. 


ADP  Algorithm 

arg  “ax  •7,., (x*., )} 

ControJ^ptlonS^ection 


Predict  Future  Value 


Figure  1 7  ADP  Research  Framework  in  the  Context  of  the  Hybrid,  Multi-Rate  Control 

Architecture 


4.3  COMBINATORIAL  OPTIMIZATION  ALGORITHMS 

In  this  section,  the  combinatorial  optimization  algorithms  that  were  incorporated  into  our 
hybrid,  multi-rate  control  architecture  are  presented.  These  algorithms  were  selected  based  on  a 
trade  study  analysis  of  different  combinatorial  optimization  approaches  for  a  1 -stage  problem. 
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Details  of  this  trade  study  appear  in  Appendix  8.5.  From  this  trade  study,  three  approaches  were 
chosen:  combinatorial  rollout,  maximum  marginal  return,  and  the  surrogate  method.  The 
formulation  of  each  of  these  techniques  will  be  presented  in  the  subsequent  sections.  We  first 
begin  with  a  general  discussion  of  combinatorial  optimization. 

As  noted  in  Section  4.1,  optimization  of  the  Q-factor  is  fundamental  to  our  control 
formulation.  In  a  standard  SDP  implementation,  the  control  space  is  enumerated,  thus 
guaranteeing  an  optimal  solution.  However,  in  the  JAO  problem,  the  complexity  of  the  control 
space  and  the  requirements  for  near  real-time  computation  make  explicit  enumeration  infeasible 
for  problems  of  interest.  Therefore,  in  the  course  of  our  proof-of-concept  experiments,  we 
investigated  several  alternatives  to  direct  enumeration.  These  alternatives  are  combinatorial 
optimization  techniques,  adapted  to  the  structure  of  the  JAO  problem,  which  are  approximate 
optimization  techniques.  Our  rationale  for  using  approximate  combinatorial  optimization  is  that 
the  JAO  problem,  even  in  the  absence  of  dynamics,  is  provably  NP-Hard,  as  3-D  matching  can 
be  reduced  in  polynomial  time  to  instances  of  the  JAO  problem.  We  consider  three  specific 
approximate  combinatorial  techniques: 

•  A  greedy  algorithm,  based  on  maximum  marginal  return,  using  resource  by  resource 
decomposition; 

•  Combinatorial  rollout,  which  is  an  approximate  technique  for  incrementally  building  a 
solution; 

•  Surrogate  method,  which  embeds  the  combinatorial  optimization  in  a  larger  continuous  space 
optimization. 

We  describe  each  of  these  techniques  in  greater  detail  below. 

To  better  define  the  combinatorial  problem,  consider  the  problem  of  optimizing  a  known 
function  over  a  set  of  feasible  controls  e  U[xi^)  for  a  given  state  In  the  JAO 

case,-  each  control  corresponds  to  a  set  of  resource  allocation  pairs  to  possible 

missions  /;  that  is, 

u,  =  {( ),...,  ( Af ,  Af 

in  which  A/'^"''^  strike  aircraft  and  weasel  aircraft  are  assigned  to  the  /-th  target  for  all  i. 

Due  to  the  potential  coupling  in  effectiveness  across  missions  which  fly  across  similar  air 
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defenses,  function  Q  cannot  be  decomposed  as  an  additive  sum  of  effectiveness  across  missions. 
The  combinatorial  optimization  problem  becomes 

max 

subject  to  resource  constraints  on  the  total  availability  of  strike  and  weasel  aircraft,  which  may 
be  written  as 


j^Weasel  <  ) 

where  and  correspond  to  the  availability  of  each  aircraft  type  at  a  given 

state  Xj^ . 

4.3.1  Maximum  Marginal  Return 

The  first  algorithm  we  discuss  is  the  Maximum  Marginal  Return  algorithm  (MMR).  The 
basis  for  this  algorithm  is  the  following  optimization  problem,  which  is  a  simplified 
mathematical  model  of  the  JAO  problem  that  provides  explicit  approximations  for  the  function 
Q(x,u)  for  a  single  wave  scenario. 

Assume  that  we  have  present  k  =  SAM  sites  in  the  scenario,  /  =  l,...,r  targets, 

w  =  l,...,W  weasels,  and  s  =  l,...,S  strike  aircraft  in  the  scenario.  The  simplified  model  assumes 
that  trajectories  for  each  target  t  are  known,  and  have  a  sequence  of  SAMs  associated  with  each 
trajectory.  When  a  weasel  interacts  with  SAM  k  while  headed  for  target  t,  the  probability  that  the 
SAM  is  destroyed  is  given  by  .  We  make  the  probabilistic  assumption  that  interactions 
between  SAMs  and  weasels  are  independent  events.  In  this  case,  given  m  weasels  headed  for 
target  t,  the  probability  that  SAM  k  survives  is  given  by: 

Similarly,  given  a  full  set  of  missions,  with  m,  weasels  headed  for  target  t,  the  overall  probability 
that  SAM  k  survives  is  given  by 

t 

This  simple  equation  requires  the  additional  assumption  that  risk  to  weasels  is  negligible  in  these 
interactions. 
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The  second  part  of  the  model  represents  the  interactions  between  SAMs  and  strike 
aircraft,  and  between  strike  aircraft  and  targets.  Let  denote  the  probability  that  a  strike 
aircraft  which  reaches  target  t  destroys  the  target.  Let  denote  the  probability  that  if  SAM  k  is 
still  alive,  it  will  destroy  a  strike  aircraft  headed  for  target  t.  Note  the  dependence  on  the  target, 
which  represents  how  strongly  a  specific  SAM  can  interact  with  the  route  headed  for  target  t. 
Assuming  again  that  interaction  events  between  SAMs  and  strike  aircraft,  between  strike  aircraft 
and  targets,  and  between  weasel  aircraft  and  SAMs  are  mutually  independent,  we  obtain  the 
following  expressions:  Let  n,  denote  the  number  of  strike  aircraft  allocated  to  target  t.  Then,  the 
probability  that  a  strike  aircraft  headed  for  target  t  reaches  target  t  is  given  by  P(/): 

k  t' 

and  the  probability  that  target  t  survives  is  given  by 

Given  target  values  F,  the  combinatorial  optimization  problem  of  interest  is: 

max  {t)) 

t 

subject  to  <  S,  ^ 

t  t 

The  above  objective  function  exhibits  the  coupling  between  packages  headed  for  different  targets 
affected  by  common  SAMs,  plus  the  coupling  between  weasels  and  strike  aircraft  in  packages. 
Consideration  of  this  objective  function  reveals  that  if  the  weasel  allocations  to  each  package  are 
fixed,  then  the  objective  function  becomes  a  separable  objective  function  over  the  strike  aircraft 
content,  where  each  separable  component  is  concave.  That  is,  if  we  fix  m,,  the  objective  function 
becomes 

t 

where 

has  a  concave  envelope.  This  means  that  finding  the  optimal  strike  aircraft  allocation  is  a  simple 
optimization  problem  if  the  weasel  allocation  is  known.  This  problem  is  solvable  optimally  using 
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a  maximum  marginal  return  algorithm,  which  assigns  strike  aircraft  incrementally  to  the  target 
that  offers  the  greatest  increase  in  performance  per  strike  aircraft  assigned.  This  approach 
suggests  an  alternating  procedure,  which  alternates  between  selecting  weasel  aircraft 
assignments  and  strike  aircraft  assignments.  Unfortunately,  fixing  the  strike  aircraft  assignment 
does  not  decouple  the  weasel  assignment  problem,  which  still  remains  a  combinatorially  hard 
problem. 

The  MMR  algorithm  which  we  developed  uses  the  alternating  decomposition,  while 
applying  an  incremental  marginal  return  algorithm  for  both  the  weasel  and  strike  aircraft 
allocation  to  packages.  It  also  uses  the  more  detailed  objective  function  Q  which  arises  from  our 
ADP  approaches.  The  algorithm  is  outlined  below; 

•  Initially,  allocate  one  strike  aircraft  per  mission.  That  is,  let  =  i  for  all  i. 

•  Set  =  0  for  all  i.  Determine  weasel  aircraft  allocations  per  package  as  follows: 

•  For  each  package  /,  compute  the  marginal  return  MR{j)  as  follows:  Let 

denote  the  performance  achieved  by  adding  k  weasel  aircraft  to 
package  i,  for  A:  =  1, 2.  Then, 

MR{i)  =  max{e;"'*({(iVf"'%  A“)})  /  A:} 

•  Select  the  package  i  with  largest  M?(/),  and  increase  + 1 

•  Repeat  until  no  unallocated  weasel  aircraft  remain. 

•  Set  =  0  for  all  /.  Determine  strike  aircraft  allocations  per  package  as  follows: 

•  For  each  package  i,  compute  the  marginal  return  MR[i)  as  follows:  Let 

)})  denote  the  performance  achieved  by  adding  k  strike  aircraft  to 
package  /,  for  ^  =  1,2.  Then, 

MR{i)  =  maxje;^*  ({( )})  /  k] 

•  Select  the  package  /  with  largest  MR{i),  and  increase  + 1 . 

•  Repeat  until  no  unallocated  strike  aircraft  remain. 

•  Repeat  iteration  between  assignment  of  weasel  and  strike  aircraft  until  convergence  or  a 
fixed  number  of  iterations  are  performed. 


TR-1048 


43 


11/30/2001 


ALPHATECH.  Inc. 


Use  or  disclosure  of  data  marked  by  an  asterisk  C) 
is  subject  to  the  restrictions  on  the  cover  page. 


The  above  algorithm  incorporates  several  important  extensions  to  address  the  use  of  the 
objective  function  Q  instead  of  the  simpler  model.  In  particular,  to  deal  with  possible  regions 
where  the  function  is  not  concave,  we  use  two  increments  (k  =  1,2)  to  compute  the  maximum 
marginal  return.  Note  that  the  above  algorithm  is  an  approximate  algorithm,  as  the  true  objective 
function  for  a  multistage  problem  will  not  be  separable  or  concave.  It  does  provide  a  fast, 
approximate  algorithm,  which  can  be  used  in  combination  with  other  algorithms  such  as 
combinatorial  rollout,  which  we  consider  next. 

4.3.2  Combinatorial  Rollout 

The  Combinatorial  Rollout  algorithm  is  a  recent  algorithm  developed  by  Bertsekas  et  al 
[BTW97].  The  basic  idea  of  the  algorithm  is  to  improve  on  the  performance  of  a  baseline 
algorithm,  which  in  our  case  is  the  MMR  algorithm  described  above.  These  incremental 
improvements  are  related  to  the  policy  improvement  step  in  standard  policy  iteration  algorithms 
for  dynamic  programming. 

In  combinatorial  rollout,  we  solve  the  optimization  problem,  one  package  at  a  time,  as 
follows: 

•  Order  the  package  indices  i  =  I,..., T-  Let  the  current  index  /'  =  1 . 

•  Assume  that  packages  ( Af  are  fixed  for  i  <  i' .  Determine  the  package  allocation 

to  target  i'  as  follows: 

•  Enumerate  all  possible  package  allocations  to  target  i'.  For  each  package  allocation 
^j^strike  Weasel  allocatc  remaining  strike  aircraft  and  weasel  aircraft  (not  already 

allocated  to  packages  i  <  i'  or  allocated  to  i')  using  MMR  algorithm  to  packages  i  >  i', 
and  evaluate  the  performance  of  the  composite  assignments. 

•  Select  as  the  allocation  which  gives  the  maximum  performance  in  the 

previous  substep. 

•  If  /'  <  T,  increment  /'  =  /'  +  !;  else,  the  algorithm  is  complete. 

In  a  setting  where  the  performance  function  is  evaluated  exactly,  combinatorial  rollout  is 
guaranteed  to  perform  no  worse  than  the  baseline  algorithm,  provided  the  baseline  algorithm 
satisfies  a  mild  condition  of  sequential  consistency  [BTW97]  which  our  MMR  algorithm 
satisfies.  However,  when  the  performance  function  is  evaluated  only  approximately,  as  in 
stochastic  settings,  this  performance  improvement  is  not  guaranteed.  In  particular,  the 
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incremental  nature  of  the  algorithm  makes  it  difficult  to  distinguish  among  package  allocations 
where  the  difference  in  performance  is  of  the  same  order  of  magnitude  as  the  evaluation  error.  In 
the  next  section,  we  describe  a  combinatorial  algorithm,  which  is  explicitly  designed  for 
optimization  of  functions  with  uncertainty  in  performance  evaluation. 

4.3.3  Surrogate  Method 

The  Surrogate  Method  [GoCassOO]  is  a  gradient-based  approach  for  searching  the  control 
space  (i.e.,  sets  of  feasible  missions).  This  approach  constructs  a  continues  "surrogate"  objective 
function  that  is  used  to  generate  gradient  information  that  guides  a  search  through  the  discrete 
space  of  mission  allocations.  It  also  uses  a  stochastic  approximation  technique,  which  allows  for 
uncertain  gradient  information  evaluation.  The  gradient  information  is  obtained  by  selecting  a  set 
of  neighbor  points  to  the  current  allocation,  and  evaluating  the  function  at  these  neighbor  points. 
The  principal  idea  of  the  surrogate  method  is  to  embed  the  combinatorial  optimization 

problem  into  a  continuous  optimization  problem.  Let  denote  the  combinatorial 

decision  variables.  The  surrogate  method  instead  optimizes  over  variables 

where  x  denotes  a  continuous  allocation.  Let  Q  denote  the  combinatorial  performance  index;  the 
key  problem  is  that  Q  is  only  defined  on  feasible  nonnegative  integer  package  assignments.  The 

algorithm  extends  the  function  Qtoa  surrogate  function  defined  on  continuous 

package  assignments  as  follows: 

•  Given  )| ,  find  the  closest  integer  assignment  )| ,  and  evaluate 

the  performance  0({(iVf"''*%iV“)}) . 

•  Find  23"  neighbors  of  |(A^f  by  modifying  the  number  of  strike  or  weasel  aircraft 

to  each  package  one  at  a  time  by  one  aircraft,  and  evaluate  the  performance  Q  for  each 
neighbor. 

•  Use  the  2r+  1  values  to  evaluate  as  a  linear  interpolation  of  the  comer 

values. 
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Note  that  the  above  approximation  has  at  nonnegative 

integer  values  of  the  allocations  Since  R  is  now  defined  over  continuous 

variables,  one  can  compute  the  gradient,  which  is  piecewise  constant  over  regions  where  the 
closest  comer  and  the  neighbors  are  constant.  Note,  however,  that  the  continuous  function  is  not 
differentiable  at  integer  values,  because  these  values  are  at  the  intersection  of  different  piecewise 
linear  approximations  (i.e.  an  integer  point  is  a  comer  for  many  regions). 

The  surrogate  algorithm  is  summarized  as  follows: 

•  Initialize  a  feasible  guess  at  the  package  allocations  across  all  targets. 

•  Initialize  the  step  size  =  1,  and  the  fractional  package  allocations  by 

perturbing  the  integer  assignments  Initialize  the  iteration  idex  to  ^  =  1. 

•  Perform  a  gradient  iteration  as  follows: 

•  Compute  the  27  +  1  neighbors  of  the  fractional  package  allocations 

evaluate  the  Q  function  at  the  neighbors,  and  evaluate  the  gradient 

•  Modify  the  fractional  allocation  as 

•  If  the  new  allocation  is  infeasible,  project  it  inside  the  feasible  region  by 

reducing  each  allocation  by  the  same  proportionality  constant. 

•  Increase  the  iteration  index  k  =  k  +  l,  reduce  the  step  size  as  =  1  /  ^ . 

•  Compute  nearest  feasible  integer  allocation  j  and  evaluate  its  performance. 

•  Repeat  iterations  for  specified  number  of  iterations. 

Because  of  the  piecewise  linear  nature  of  the  approximation,  the  optimal  fractional 
solution  is  at  an  integer  value;  thus,  if  the  optimization  finds  the  optimal  allocation  for  the 
surrogate  cost  function  it  also  finds  the  optimal  allocation  for  the  original 

objective  function  [GoCassOO].  Furthermore,  the  slowly-decreasing  step 

size  guarantees  convergence  to  a  local  optimum  even  if  the  evaluations  of  the  function  Q  are 
noisy.  However,  as  a  gradient  descent  algorithm,  the  surrogate  optimization  method  often 
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converges  to  a  local  instead  of  a  global  optimum.  To  overcome  this  limitation,  we  implement  a 
couple  of  steps:  First,  we  initialize  the  algorithm  with  the  MMR  solution  described  previously. 
Second,  we  also  perform  several  repetitions  of  the  algorithm  from  randomly  selected 
assignments,  and  select  the  best  of  the  resulting  allocations. 

4.4  DESIGN  MODEL  APPROACH 

In  this  section,  we  develop  a  design  model  for  the  system  dynamics,  which  are  describe 
in  substantial  detail  in  Section  3.2.  Similar  to  the  hybrid  dynamical  model,  our  design  model 
exploits  the  fact  that  interactions  between  JAO  objects  are  sparse  and  involve  relatively  few 
objects  in  each  event,  in  order  to  achieve  compact  and  efficient  evaluation  methods.  In  the 
following,  we  outline  our  implementation  for  the  JAO  environment. 

The  design  model  is  based  on  a  small  set  of  dynamical  models  and  their  associated 
interaction  dynamics.  These  models  correspond  to  physical  objects,  such  as  air  packages,  threats, 
and  targets.  This  direct  mapping  simplifies  construction  of  the  design  model.  Based  on  the 
physical  objects,  the  design  models  may  be  distilled  from  the  more  detailed  hybrid  dynamical 
model,  as  follows. 

Parallel  composition  of  individual  object  models  provides  a  concise  representation  of  the 
JAO  dynamics  in  the  absence  of  interactions.  When  objects  interact,  e.g.,  threat  and  target 
engagements,  a  product  composition  of  the  associated  objects  concisely  represents  the 
interaction.  To  represent  these  interactions,  a  set  of  composite  models  representing  pairs  of 
interacting  objects  was  developed.  These  models,  based  on  individual  exchanges  in  an 
underlying  Markov  process,  capture  the  complex  interaction  dynamics,  e.g.,  weasel  suppression, 
target  acquisition,  etc.  The  first  of  these  models  is  a  transition  model  between  an  air  package  i 
and  an  air  defense  SAM  j.  Let  denote  the  discrete  state  associated  with  the  air  package, 

consisting  of  the  number  of  aircraft  of  each  type  which  remain  alive  in  the  package,  and  let  sj 
denote  the  state  of  the  SAM.  In  our  SAM  models,  the  SAM  can  be  in  one  of  five  states,  as 
described  in  our  hybrid  dynamics.  Given  the  hybrid  state  trajectory  of  the  air  package  and  the 
capabilities  of  the  SAM,  we  compute  the  transition  kernel  P(7rj(+),s  j{+)  |  ;r,(-),5^(-)),  which  is 

a  matrix  indexed  by  possible  package  contents  into  and  out  of  the  engagement,  and  SAM  status 
into  and  out  of  the  engagement. 
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The  second  model  is  a  transition  model  between  an  air  package  /  attacking  target  k. 
Targets  can  be  in  one  of  two  states,  alive  or  dead.  As  before,  these  dynamics  are  lifted  from  the 
detailed  hybrid  model,  and  represented  as  a  transition  kernel  between  the  joint  states, 
as  Pin  I  (+),  4  (+)  I  n  I  (-),  4  (-)) .  Other  models  represent  independent  transition  dynamics  in  SAM 
states,  as  SAMs  turn  on  and  off,  or  are  repaired  after  incurring  damage.  By  factoring  the  joint 
probabilities  at  the  end  of  each  stage  into  products  of  marginal  probabilities  for  each  object,  one 
obtains  an  efficient  prediction  of  the  distribution  at  the  end  of  a  wave  of  activity,  which  can  be 
used  to  compute  the  performance  statistics  associated  with  any  given  strategy. 

The  above  prediction  approach  is  based  on  propagating  through  a  pre-specified  sequence 
of  interaction  events.  An  important  extension  that  we  considered  in  this  work  is  closed-loop 
prediction,  where  the  particular  sequence  of  interaction  events  depends  on  the  specific  states  that 
are  observed  as  outcomes.  For  instance,  after  a  specific  interaction  between  an  air  package  and  a 
SAM,  the  air  package  may  abort  its  mission  if  the  number  of  surviving  weasel  aircraft  and  strike 
aircraft  falls  below  required  quantities.  Similarly,  the  missions  selected  in  the  second  wave  of  an 
attack  depend  on  the  relative  success  of  the  first  wave  missions  in  eliminating  targets,  and  the 
number  of  aircraft  surviving  the  first  wave. 

We  focus  on  the  problem  of  two-stage  prediction,  where  information  is  collected  at  the 
end  of  a  stage  or  wave,  and  the  next  set  of  missions  is  then  adapted  to  the  results  of  the  first 
wave.  Closed-loop  prediction  depends  on  the  arrival  of  information.  We  assume  that  the  state  at 
the  end  of  the  first  wave  is  observed,  and  that  the  strategy  for  the  second  wave  is  then  computed. 
Our  evaluation  approach  is  based  on  computing  an  analytical  approximation  to  the  distribution  of 
this  state  as  before,  sampling  this  distribution  to  generate  a  finite  number  of  representative 
scenarios.  For  each  sample  scenario,  we  use  a  combinatorial  algorithm  using  a  single  wave 
anal5d:ical  approximation  of  the  performance  criteria  to  determine  both  the  desired  sequence  of 
missions  and  their  expected  performance.  The  performance  achieved  for  these  samples  is  then 
averaged  to  obtain  estimates  of  the  two-wave  performance. 

Another  extension  that  we  considered  was  a  model  of  partial  information  arrival,  where 
ISR  sensors  were  scheduled  over  the  battle  space.  In  this  case,  the  observations  are  perfect,  but 
occur  over  time,  and  are  localized  around  the  ISR  sensors.  The  localized  observations  result  in 
partial  information  regarding  the  battlespace.  These  observations  are  projected  forward  to  the 
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current  time  using  the  same  probabilistic  transition  models  discussed  previously.  The  result  is  a 
probabilistic  state  estimated  at  the  decision  point  from  which  decisions  must  be  made. 

4.5  COMBINATORIAL  OPTIMZATION  ALGORITHMS  IMPLEMENTATED 

This  section  describes  how  the  plant  implements  the  different  controller  algorithms 
described  in  Section  4.3,  including  when  and  how  they  are  used.  Further  details  about  each 
algorithm  are  also  given.  As  it  will  be  evident  in  the  sections  below,  the  controllers  may  be  split 
up  into  two  distinct  groups  depending  on  their  function.  One  type  is  used  to  allocate  resources 
(strike  and  weasel  aircraft)  at  the  airbase  to  newly  constructed  air  package  objects,  which  the 
controller  outputs.  In  addition  to  configuring  the  air  packages,  the  controller  also  sets  their  initial 
missions.  The  second  type  of  controller  is  used  to  modify  previously  created  air  packages’ 
missions  while  in  flight.  No  controller  has  the  ability  to  reconfigure  air  packages  while  in  flight, 
such  as  moving  assets  between  packages. 

4.5.1  Retasker  using  Combinatorial  Rollout 

This  controller  may  be  called  by  the  plant  when  a  given  TCT  emerges.  It  has  a  very 
specific  role  of  finding  the  air  packages  that  qualify  for  retasking  to  the  TCT,  determining  the 
best  one  (if  any)  to  divert  to  it,  and  modifying  its  mission  accordingly.  In  order  for  an  air 
package  to  qualify  for  retasking,  it  must  meet  the  following  two  criteria:  1)  it  must  be  currently 
flying  ingress  to  a  normal  target,  and  2)  its  ingress  mission  route  must  intersect  a  circle  of  a 
given  radius  around  the  TCT,  as  illustrated  in  Figure  18  below.  The  radius,  or  retask  range,  used 
in  all  experiments  was  50  km.  If  an  air  package  is  tasked  to  an  AOR,  it  automatically  qualifies 
for  retasking  if  the  TCT  is  within  the  same  AOR.  The  reason  for  implementing  a  localized 
retasking,  via  the  intersection  range  and  AOR  groupings,  is  to  avoid  drastic  geographic  changes 
to  an  air  package’s  mission,  which  would  have  a  greater  disturbance  on  the  highly  coupled 
missions  of  the  other  aircraft.  Only  allowing  the  packages  that  were  already  passing  near  the 
TCT  to  divert  to  it  should  minimize  the  effect  on  coupled  activities  of  the  previously  configured 
packages,  such  as  in  the  coordinated  suppression  of  enemy  air  defense.  This  is  very  important 
since  only  one  air  package  is  diverted  to  the  TCT,  and  the  other  packages  are  not  retasked  to 
account  for  the  change  in  the  mission  queue.  The  Retasker  was  implemented  in  this  specific, 
localized  manner  in  order  to  provide  real-time  control  upon  TCT  emergence,  and  also  to  avoid 
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the  potential  snowball  effect  that  allowing  other  air  packages  to  divert  to  newly  unassigned 
targets  (resulting  from  previous  retaskings)  could  have. 

When  using  the  Retasker,  the  plant 
will  request  guidance  immediately 
following  any  TCT  emerge  event  (it  is  part 
of  the  event’s  execution  logic),  passing  the 
controller  the  name  of  the  TCT,  in 
addition  to  the  required  estimated  state  of 
the  world.  If  an  air  package  is  already 
assigned  to  the  TCT,  then  the  retask  call  is 
unnecessary  and  will  terminate. 

Otherwise,  the  Retasker  first  determines 
which  air  packages  are  valid  candidates  as 
explained  above.  It  then  uses  the 
combinatorial  rollout  algorithm  discussed 
in  Section  4.1  to  find  the  best  retask  Figure  18  TCT  Retasking  Radius 

option.  This  is  done  by  changing  one  of  the  valid  air  package’s  missions  at  a  time,  and 
evaluating  the  effect  on  the  entire  mission  queue  using  the  one-wave,  one-stage  predictor  (see 
Section  4.6.1).  The  option  that  gives  the  best  performance  expectation  is  selected,  and  if  its 
predicted  value  improves  that  of  the  original  mission  queue,  the  corresponding  air  package  is 
retasked  by  modifying  its  mission  and  returning  it  to  the  plant  as  guidance. 

4.5.2  Aborter  using  Combinatorial  Rollout 

This  specialized  controller  is  used  in  a  variety  of  circumstances  to  decide  whether  one  or 
more  ingress  air  packages  should  abort  their  current  missions  and  return  to  the  airbase.  The 
ability  to  abort  missions  is  useful  to  avoid  attrition  to  air  packages,  especially  in  an  uncertain 
hostile  environment.  Depending  on  the  current  state  of  the  world,  an  air  package  may  be  aborted 
whenever  the  expected  gain  of  prosecuting  its  assigned  target  (and  value  of  supporting  other 
aircraft)  is  outweighed  by  the  potential  for  further  attrition.  This  allows  the  ability  to  save 
resources  that  might  otherwise  be  ineffective  and/or  destroyed. 
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The  Aborter  functions  differently  depending  on  how  the  plant  uses  it.  One  way  the  plant 
may  employ  this  controller  is  by  calling  it  automatically  at  periodic  time  intervals.  In  this  case, 
the  Aborter  performs  a  “full  abort,”  giving  each  air  package  the  opportunity  to  abort  its  mission, 
by  using  the  combinatorial  rollout  algorithm  discussed  in  Section  4.1.  This  works  by  aborting 
each  ingress  air  package  one  at  a  time,  evaluating  the  effect  on  the  entire  mission  queue  using  the 
one-wave,  one-stage  predictor  (see  Section  4.6.1).  The  option  that  gives  the  best  performance 
expectation  is  selected,  and  if  its  predicted  value  improves  that  of  the  current  mission  queue,  the 
corresponding  air  package  is  aborted  by  modifying  its  mission  to  return  to  the  airbase 
immediately.  This  process  continues  using  the  remaining  air  packages,  evaluating  each  option  in 
a  mission  queue  that  includes  any  previously  aborted  packages,  until  the  mission  queue  cannot 
be  improved  by  aborting  any  air  packages.  All  aborted  packages  are  then  returned  to  the  plant  as 
guidance. 

Another  way  the  plant  uses  the  Aborter  is  when  a  SAMSite  emerges  from  an  unknown 
(hiding),  inactive,  or  repairing  state.  If  a  SAMSite  activates  in  response  to  the  intersection  of  an 
air  package  with  it’s  range  of  lethality  (i.e.  missile  range)  in  order  to  engage  the  package,  the 
Aborter  is  called  after  the  engagement  and  any  possible  post-engagement  SAMSite  transitions 
occur.  This  is  useful  for  responding  to  an  uncertain  engagement,  which  may  have  resulted  in  a 
loss  of  aircraft  that  could  compromise  the  success  of  both  the  attritted  air  package’s  mission  and 
any  other  missions  with  which  it  is  coupled.  In  this  case,  the  Aborter  first  decides  whether  or  not 
the  just-engaged  air  package  should  abort  its  mission  by  comparing  the  respective  performance 
expectations  using  the  one- wave,  one-stage  predictor.  If  so,  a  full  abort  is  performed  (as 
explained  above)  to  account  for  the  potential  effect  on  any  other  air  packages.  The  initial 
fixation  on  the  air  package  involved  in  the  engagement  is  done  to  speed  up  the  computation  time, 
and  also  as  an  attempt  to  localize  the  abort. 

If  the  Aborter  is  called  in  response  to  a  SAMSite  emergence  that  occurred  stochastically, 
not  due  to  a  specific  simulation  event,  the  Aborter  functions  slightly  differently.  It  first  decides 
if  any  ingress  air  packages  currently  routed  through  the  emerged  SAMSite ’s  range  of  lethality 
should  be  aborted.  This  is  done  very  similarly  as  a  full  abort,  except  the  abort  candidates  are 
limited  to  this  subset  of  air  packages.  If  any  of  these  packages  were  retasked  to  the  airbase,  a  full 
abort  is  then  performed  on  the  remaining  air  packages  to  account  for  the  possible  coupling  across 
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missions.  Similarly  as  above,  the  first  step  is  performed  to  both  minimize  computation  time  and 
to  localize  the  Aborter’s  effect  to  just  those  air  packages  involved  with  the  specific  SAMSite. 
4.5.3  Target  Tasking  using  Maximum  Marginal  Return 

The  MMR  controller  is  used  to  constmct  and  configure  new  air  packages  from  resources 
at  the  airbase,  and  task  them  to  targets.  First,  the  controller  must  create  new  air  packages,  one 
for  each  unassigned  target  in  the  estimated  world  state  provided  by  the  plant.  Each  package  is 
initialized  with  zero  weasel  and  one  strike  aircraft.  As  detailed  in  Section  4.3.1,  the  next  step  is 
to  allocate  weasel  aircraft  to  the  set  of  packages.  This  is  done  by  temporarily  incrementing  the 
number  of  weasels  in  each  package  one  at  a  time,  evaluating  each  option  within  the  current 
mission  queue  configuration  using  any  of  the  possible  predictors  (see  Section  4.6),  and 
permanently  adding  the  weasel  aircraft  to  the  package  with  the  best  performance  expectation. 
This  process  continues  until  either  the  airbase  runs  out  of  weasel  aircraft  to  allocate,  or  adding 
weasels  to  any  package  does  not  improve  the  mission  queue’s  predicted  value.  The  algorithm 
may  be  tuned  by  changing  the  maximum  number  of  weasel  aircraft  allowed  in  an  air  package, 
changing  the  increment  used  when  allocating  weasels  (how  many  to  allocate  at  a  time),  or  even 
by  stepping  multiple  increments  ahead  when  searching  the  control  space  for  the  best  predicted 
value.  To  clarify  the  latter  option,  take  the  example  of  allocating  two  weasels  at  a  time.  If  we 
allow  three  steps  into  the  control  space,  then  the  number  of  mission  queues  to  evaluate  would 
equal  the  number  of  air  packages  times  three,  each  being  incremented  by  two,  four,  or  six  weasel 
aircraft.  The  one  option  that  gives  the  best  performance  expectation  would  have  its 
corresponding  air  package’s  weasel  count  incremented  by  two  weasels,  regardless  of  how  many 
were  allocated  when  testing  that  option. 

Next  the  strike  aircraft  are  allocated  using  a  similar  process.  Initially,  the  previous 
weasel  assignment  is  untouched,  but  the  strike  aircraft  in  each  package  are  cleared  to  zero.  The 
strike  aircraft  at  the  base  are  then  allocated  incrementally  just  as  the  weasels  were  above.  The 
only  difference  is  in  how  the  control  space  may  be  “stepped  into.”  If  strike  aircraft  were  used  in 
the  previous  weasel  allocation  example,  options  with  four  or  six  aircraft  would  only  be  evaluated 
if  no  options  with  two  or  four  aircraft,  respectively,  improved  the  mission  queue’s  expected 
performance.  These  tunings  of  the  weasel  and  strike  assignment  were  used  to  overcome  specific 
problems  where  the  MMR  would  exit  prematurely  before  allocating  many  of  the  aircraft, 
depending  on  how  the  algorithm  was  being  used.  The  weasel  and  strike  allocation  cycles  may  be 
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repeated  as  desired,  although  one  or  two  iterations  of  the  algorithm  is  usually  sufficient.  By 
clearing  the  particular  type  of  aircraft  being  assigned  from  the  current  mission  queue,  it  may  be 
possible  to  improve  that  aircraft’s  allocation  using  the  current  allocation  of  the  other  types  of 
aircraft. 

Like  any  other  controller,  the  plant  may  employ  the  MMR  whenever  it  deems  it  useful. 

In  the  MMR’s  case,  this  would  be  any  time  air  packages  may  be  formed  and  tasked  to  targets, 
whether  the  resources  are  at  an  airbase  or  an  AOR.  Commonly,  the  MMR  would  be  used  at  the 
start  of  the  simulation  and  at  the  end  of  each  consequent  wave  of  attack,  that  is,  when  all  air 
packages  return  to  base  after  executing  their  missions.  Alternately,  the  MMR  could  be  called 
when  each  air  package  returns  to  base,  a  “gorilla”  air  package  arrives  at  its  AOR  waypoint,  or 
even  at  periodic  time  intervals. 

4.5.4  AOR  Tasking  using  Maximum  Marginal  Return 

This  controller  is  a  variation  of  the  MMR  algorithm  used  to  task  air  packages  to  targets. 

Its  main  function  is  still  to  allocate  weasel  and  strike  aircraft  to  air  packages.  But  instead  of 
tasking  them  to  targets,  their  missions  are  to  AOR  waypoints,  at  which  a  target  tasking  controller 
is  used.  There  are  two  ways  of  using  this  controller,  depending  on  how  many  stages  the  user 
wants  to  model.  The  traditional,  more  accurate  way  of  using  the  controller  is  to  form  a  gorilla 
package  for  each  AOR,  and  allocate  resources  just  like  in  the  target  tasking  MMR  (see  Section 
4.5.3),  but  using  a  two-stage  predictor  to  evaluate  mission  queues  of  gorilla  air  packages  (see 
Section  4.6.5).  The  first  stage  represents  the  gorilla  packages  flying  to  their  AOR  waypoints, 
and  the  second  stage  is  the  tasking  of  smaller  air  packages  from  the  AORs  to  targets  and  back  to 
the  airbase.  The  other  way  of  using  this  controller  is  in  a  one-stage  context.  This  case  works  just 
like  the  target  tasking  MMR,  actually  configuring  air  packages  tasked  to  unassigned  targets.  The 
only  difference  is  that  they  are  routed  through  the  AOR  waypoint  instead  of  directly  to  the  target. 
After  the  aircraft  allocation  is  complete,  the  aircraft  in  all  air  packages  tasked  to  targets  in  each 
AOR  are  conglomerated  into  larger  gorilla  air  packages  tasked  to  the  respective  AORs.  The  only 
reason  to  use  this  one-stage  method  in  lieu  of  using  the  better  two-stage  predictor  is  to  save 
computation  time.  This  controller  is  implemented  by  the  plant  at  the  same  times  as  the  target 
tasking  MMR,  except  it  can  only  be  used  to  task  aircraft  at  the  airbase,  not  when  they  arrive  at  an 
AOR  waypoint. 
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4.5.5  Target  Tasking  using  Surrogate  Method 
The  surrogate  method  is  another  controller  used  to  construct  and  configure  new  air 

packages  from  resources  at  the  airbase,  and  task  them  to  targets.  It  is  implemented  by  the  plant 
in  the  same  way  as  the  MMR  (see  the  end  of  Section  4.5.3).  First,  the  controller  must  create 
unconfigured  air  packages,  one  for  each  unassigned  target  in  the  estimated  world  state  provided 
by  the  plant.  A  gradient-based  approach  then  searches  the  control  space  to  allocate  strike  and 
weasel  aircraft  to  the  different  packages,  as  detailed  in  Section  4.3.3.  One  addition  to  the 
algorithm’s  description  above  is  an  option  that  was  added  to  speed  it  up.  Rather  than  restarting 
the  algorithm  multiple  times  with  random  air  package  configurations  (or  target  assignments),  it 
was  useful  to  seed  the  surrogate  with  the  MMR’s  solution,  and  then  iterate  over  that  to  try  to 
improve  upon  it  by  moving  aircraft  between  the  packages. 

4.5.6  AOR  Tasking  using  Surrogate  Method 

This  controller  is  a  variation  of  the  surrogate  method  used  to  task  air  packages  to  targets 
(see  Section  4.3.3).  Its  main  function  is  still  to  allocate  weasel  and  strike  aircraft  to  air  packages. 
But  instead  of  tasking  them  to  targets,  their  missions  are  to  AOR  waypoints,  at  which  a  target 
tasking  controller  is  used.  After  forming  one  “gorilla”  package  for  each  AOR,  resources  are 
allocated  using  the  same  gradient-based  approach  detailed  in  Section  4.3.3,  but  using  a  two-stage 
predictor  to  evaluate  mission  queues  of  gorilla  air  packages  (see  Section  4.6.5).  The  first  stage 
represents  the  gorilla  packages  flying  to  their  AOR  waypoints,  and  the  second  stage  is  the 
tasking  of  smaller  air  packages  from  the  AORs  to  targets  and  back  to  the  airbase.  This  controller 
is  implemented  by  the  plant  at  the  same  times  as  the  target  tasking  surrogate  method,  except  it 
can  only  be  used  to  task  aircraft  at  the  airbase,  not  when  they  arrive  at  an  AOR  waypoint. 

4.5.7  Target  Tasking  using  Combinatorial  Rollout 

The  combinatorial  rollout  controller  is  used  to  construct  and  configure  new  air  packages 
from  resources  at  the  airbase,  and  task  them  to  targets,  based  on  the  algorithm  in  Section  4.3.2. 
First,  the  controller  must  create  new  air  packages,  one  for  each  unassigned  target  in  the  estimated 
world  state  provided  by  the  plant.  The  control  space  is  then  directly  enumerated  with  every 
possible  air  package  configuration  to  each  target,  taking  into  consideration  any  constraints  on 
maximum  numbers  of  aircraft  per  package  and  incremental  assignments  of  aircraft  (i.e.  two 
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weasel  aircraft  at  a  time).  Each  option  is  selected  and  added  to  a  mission  queue  to  evaluate.  Any 
resources  remaining  at  the  airbase  (not  in  the  mission  queue)  are  used  to  form  air  packages  to 
task  to  unassigned  targets,  using  a  greedy  heuristic.  This  mission  queue  is  then  evaluated  using 
any  of  the  possible  predictors  (see  Section  4.6).  The  air  package  option  whose  heuristically 
completed  mission  queue  gives  the  best  performance  expectation  is  then  added  to  the  final 
mission  queue.  Any  other  options  in  the  enumeration  tasked  to  that  package’s  target  are 
removed  from  the  possible  candidates.  In  successive  iterations  of  this  option  selection,  the 
mission  queue  evaluated  includes  not  only  the  new  candidate,  but  also  any  previously  selected 
packages,  which  decreases  the  resources  available  to  the  heuristic  when  filling  out  the  rest  of  the 
mission  queue.  This  process  continues  until  either  all  resources  have  been  exhausted,  or  the 
current  mission  queue’s  predicted  value  cannot  be  improved  by  adding  any  option  to  it. 

This  controller  is  implemented  by  the  plant  in  the  same  way  as  the  other  target  tasking 
controllers  (see  the  end  of  Section  4.5.3),  except  it  can  only  be  used  to  task  aircraft  at  the  airbase, 
not  when  they  arrive  at  an  AOR  waypoint. 

4.6  DESIGN  MODELS  IMPLEMENTED 

The  control  architecture  described  Section  4.2.2  executes  a  measured  response,  i.e., 
trading  off  computation  and  performance  to  specific  “trigger”  events.  In  order  to  achieve 
“closed-loop”  behaviors,  control  decisions  must  explicitly  account  for  subsequent  decisions.  In 
this  section,  we  highlight  several  multi-stage  controller  designs,  typically  for  the  allocation  of 
base  resources,  which  account  for  subsequent  decisions.  In  each  case,  what  distinguishes  these 
controllers  is  the  associated  prediction  model. 

4.6.1  1-Stage/l-Wave  Model 

The  1 -Stage/ 1 -Wave  controller  determines  a  set  of  missions  over  a  single  wave.  Each 
wave  begins  when  the  air  packages  are  launched  from  base  and  ends  when  they  return. 

This  controller  solves  the  combinatorial  optimization  problem  discussed  above  (Section 
4.3),  and  therefore  any  of  the  methods  may  be  used.  Each  of  these  methods  evaluates  candidate 
control  option  using  our  design  model  (Section  4.4)  over  a  single  wave.  This  is  depicted  in 
Figure  19. 
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Figure  19  1 -Stage  Prediction 

This  is  the  baseline  controller,  which  is  also  the  basis  for  subsequent  controller 
implementations.  Due  to  relative  performance  with  respect  to  computation,  the  MMR  algorithm 
is  better  suited  for  the  1  Stage/1  Wave  case. 

4.6.2  2-Stage/l-Wave  Model  with  Retasking 

This  controller  also  considers  a  1-wave  horizon,  but  accounts  for  potential  retasking  of  air 
packages  in  response  to  emerging  TCTs.  TCT  emergence  is  a  random  event  during  the  execution 
of  a  wave.  When  a  TCT  emerges,  divertable  air  packages  (i.e.,  loaded  and  within  range)  are 
considered  for  retasking,  and  the  best  option  is  selected  (see  Section  4.5.1).  Therefore,  we  expect 
the  controller  to  assign  missions  from  base  that  anticipate  potential  retasking.  This  type  of 
proactive  control  significantly  increases  computational  complexity.  The  difficulty  is  in  predicting 
the  value  over  2  stages,  i.e.,  through  an  intermediate  decision  point  that  corresponds  to  a 
retasking  decision.  We  illustrate  this  in  Figure  20.  A  set  of  missions  is  launched  from  base. 

When  a  TCT  emerges,  a  control  decision  is  required,  which  in  general  will  map  the  current  state 
Xj  at  the  time  of  the  TCT  emergence  to  an  appropriate  retasking  control  Mj.  Given  a  specific  state 
x,  and  control  Mj,  we  are  able  to  complete  the  1-wave  prediction.  This  is  difficult  even  for  a 
single  TCT  since  the  emergence  of  the  TCT  occurs  at  a  random  point  in  time  and  the  control 
decision  is  state  dependent. 
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Wave  Launch  TCT  Emerge  Wave  Retum-to-Base 

(0th  Stage)  (1  si  Stage)  (2nd  Stage) 


Figure  20  2-Stage  Retasking  Prediction 

To  simplify  this  problem,  we  assume  that  retasking  depends  only  on  the  arrival  probably 
of  the  TCT  (rather  than  the  precise  state  of  all  air  packages,  targets,  and  threats)  and  that  the 
retask  decisions  are  synchronized  across  missions.  This  is  illustrated  in  Figure  21,  where  the  first 
air  package  to  enter  a  TCT  divert  range  triggers  a  divert  decision  for  all  air  packages. 

With  these  assumptions,  we  are  able  to  use  the  same  combinatorial  algorithms  with  a 
specialized  predictor.  Consider  Figure  20,  our  baseline  predictor  is  used  up  until  the  retasking 
point  (TCT  Emerge  in  the  figure).  At  this  point  the  retasking  controller  is  used  to  identify  a 
candidate  retask  option,  which  is  implemented  probabilistically  depending  on  the  probability  that 
the  TCT  has  emerged.  Therefore,  for  a  given  retask  option,  the  predicted  value  is  given  by  a 
weighted  average  of  either  retasking  or  not  retasking  the  individual  air  packages.  Note,  that  the 
retask  control  algorithm  needs  to  be  solved  for  each  evaluation  of  this  2-stage  controller. 

This  controller  generates  an  open  loop  control  decision  that  hedges  against  the 
probabilistic  arrival  of  TCTs,  effectively  inflating  the  value  of  missions  which  could  be  diverted 
while  encouraging  an  allocation  of  resources  appropriate  for  prosecution  of  potential  TCT  as 
well  as  original  targets. 
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4.6.3  2-Stage/2-Wave  Model 

In  this  case,  we  consider  two  waves  without  retasking.  The  controller  determines  a  set  of 
mission  for  the  first  wave,  while  explicitly  accounting  for  second-wave  control  decisions. 
Selection  of  the  first- wave  control  uses  the  same  combinatorial  algorithms  as  in  the  previous 
cases,  however  a  prediction  is  required  that  considers  two  waves.  This  is  depicted  in  Figure  22. 

The  set  of  missions  is  launched  from  base.  When  the  air  packages  return  to  base,  a 
control  decision  is  required,  which  in  general  will  map  the  current  state  x,  at  the  beginning  of  the 
second  wave  to  an  appropriate  second  wave  control  (i.e.,  set  of  missions)  Uy  Given  a  specific 
state  X,  and  control  Mj,  we  are  able  to  compute  the  2-wave  prediction  as  a  function  of  the  x,. 

We  use  our  baseline  predictor  for  the  first  wave,  which  generates  a  probabilistic 
distribution  /(x,|xo,Mo)  over  the  state  x,.  The  control  Mj  =  //(x,)  is  determined  from  a  given  state 
using  one  of  the  same  combinatorial  algorithms  that  were  available  for  the  first  stage.  To 
evaluate  the  second  stage,  we  compute  the  expected  value  -^/(a:2|ji:o,Mo)]  the  end  of  the 
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second  stage  by  averaging  over  the  state  x,  at  the  end  of  the  first  stage.  To  compute  this 
expectation,  we  enumerate  the  feasible  states  x,  at  the  end  of  the  first  stage,  determine  the  second 
wave  control  m,  =  //(jc,),  and  compute  the  value  y^/(x2|x,,Mi,Xo,Mo)j  for  the  given  state  Xj.  The 
sum  of  these  values,  weighted  by  the  probability  of  the  associated  state  Xj,  provides  the  second 
stage  value  /(jC2l^o»Wo)]-  However,  this  approach  is  only  feasible  when  the  state  space  of  x,  is 

small.  In  the  following,  we  discuss  several  approximations  of  the  second  stage  evaluation. 
Specifically,  we  consider  random  sampling,  certainty  equivalence  approximations,  and  an  open- 
loop  approximation. 


Figure  22  2-Stage/2-Wave  Predictor 

4.6.3. 1  Random  Sampling 

In  Figure  23,  we  illustrate  the  random  sampling  approach  used  to  predict  the  performance 
of  the  second  wave.  Our  design  model  is  used  to  predict  the  value  of  the  first  wave,  and  it  also 
provides  the  (approximate)  distribution  over  the  state  x,  at  the  end  of  the  first  wave. 

In  this  approach,  Monte  Carlo  integration  is  used  to  estimate  the  second  stage  performance 
y[/(x2|xo,Mo)]-  A  state  x,  is  selected  randomly  according  to  the  known  distribution  /(xJxo,Mo) 
at  the  end  of  the  first  wave.  The  control  is  determined  based  on  that  state  Xj. 
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Given  the  state  x^  and  the  control  Wj,  our  design  model  is  used  to  compute  the  value  of  the 
second  stage  j[/(x2|xi,m,,jCo,Mo)]  given  the  intermediate  state  x^,  from  which  we  estimate  the 
second  stage  performance 

^  ’  samples 


Figure  23  Random  Sampling  for  2-Stage/2-Wave  Prediction 
Using  the  MMR  algorithm,  this  approach  requires  single 

stage  evaluations  using  our  design  model.  is  the  number  of  Monte  Carlo  samples.  Nj.^^  is 

the  number  of  target  locations.  is  the  total  number  of  strike  aircraft.  is  the  total  number  of 

weasel  aircraft.  Methods  such  as  importance  sampling  could  theoretically  improve  performance. 
However,  preliminary  experiments  demonstrated  only  a  minimal  improvement  in  computation 
with  comparable  performance. 

4.6.3. 2  Certainty  Equivalent  Approximation 

In  Figure  24,  we  illustrate  a  certainty  equivalent  approach  used  to  predict  the 
performance  of  the  second  wave.  Our  design  model  is  used  to  predict  the  value  of  the  first  wave 
and  the  associated  (approximate)  distribution  /(xi|xo,  Wq)  over  the  state  x,  at  the  end  of  the  first 
wave.  In  this  approach,  a  certainty  equivalent  state  x,  is  selected  to  represent  the  entire 
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distribution.  In  our  experiments,  we  consider  the  mean  and  mode  of  the  distribution.  Given  the 
certainty  equivalent  state  x^,  a  control  wj  =  is  determined  using  the  MMR  algorithm. 

Given  the  state  J,  and  the  control  w,,  our  design  model  is  used  to  compute  the  value  of  the 
second  stage,  which  is  our  estimate  of  the  overall  second  stage  performance 


Figure  24  Certainty  Equivalent  Approximation  for  2-Stage/2-Wave  Prediction 
Using  the  MMR  algorithm,  this  approach  requires  +^w)^)  single  stage 

evaluations  using  our  design  model.  Other  certainty  equivalent  states  could  also  be  used  in  this 
context. 

4. 6. 3. 3  Aggregate  Open-Loop  Approximation 

In  Figure  25,  we  illustrate  an  aggregate  open-loop  approximation  that  is  used  to  predict 
performance  in  the  second  wave.  In  this  case,  we  augment  the  single  stage  controller  by  doubling 
the  number  of  strike  aircraft  that  are  available.  The  additional  aircraft  represent  the  reuse  of 
aircraft  in  the  second  stage.  Given  the  increased  resources,  we  determine  the  initial  control  Mq. 
Due  to  the  additional  resources,  this  control  is  infeasible,  i.e.,  there  are  not  enough  strike  aircraft 
at  base  to  perform  all  the  missions.  Therefore,  we  prune  the  control  decision  to  make  it  feasible. 
Two  methods  are  used.  First,  we  truncate  the  number  of  missions  (retaining  the  highest  value 
missions)  and  reassign  weasel  aircraft.  An  alternate  approach  sends  all  the  missions  at  half  the 
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strike  aircraft.  Our  design  model  is  used  to  compute  the  value  of  the  first  wave 

I  )]  with  the  additional  aircraft.  The  additional  aircraft  provide  a 

representation  of  the  two  wave  performance. 


Feasible 

States 


fNs(xNs|JlJ^$^^Ns) 


X’1 


Figure  25  Aggregate  Open-loop  Approximation  for  2-Stage/2-Wave  Prediction 
Using  the  MMR  algorithm,  this  approach  requires  +1)(2A^^  +^*v))  single  stage 

evaluations  using  our  design  model.  Similar  open-loop  approximation  could  also  be  used  in  this 
context. 


4.6.4  4-Stage/2-Wave  Model  with  Retasking 

This  controller  combines  the  two  previous  controllers  to  further  extend  the  control 
horizon.  We  consider  two  waves  with  retasking.  The  controller  determines  a  set  of  missions  for 
the  first  wave,  while  explicitly  accounting  for  retasking  in  the  first  wave,  a  second-wave  control 
decision  and  retasking  in  the  second  wave.  Selection  of  the  first-wave  control  uses  the  same 
combinatorial  algorithms,  however  a  prediction  is  required  that  considers  two  waves  with 
retasking.  Two  of  the  four  stages  are  depicted  in  Figure  26. 

We  use  our  retasking  predictor  (Figure  20)  for  the  first  wave,  which  results  in  a 
probabilistic  distribution  over  the  state  x^-  The  control  M2  =  lA.^2)  is  determined  from  a  given 
state  using  the  same  combinatorial  algorithms  that  were  used  for  a  single  stage.  To  evaluate  the 
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second  stage,  we  average  performance,  again  using  the  retasking  predictor,  over  the  state  . 
This  is  implemented  using  one  of  the  approximation  techniques  discussed  in  the  previous 
section,  i.e.,  random  sampling,  certainty  equivalence  approximations,  and  an  open-loop 
approximation. 


4.6.5  2-Stage/l-Wave  Model  with  AOR  Tasking 

Given  a  hierarchical  decomposition  of  the  JAO  environment,  we  formulate  an  alternate  2- 
stage  problem,  which  first  allocates  "gorrilla"  air  packages  to  specific  Areas  of  Responsibility 
(AORs),  and  then  upon  arrival  to  the  AOR,  tasks  regular  air  packages  to  specific  targets.  This 
effectively  decomposes  the  1-wave  problem  into  smaller  sub-problems  associated  with 
individual  AORs.  This  begins  to  address  scalability;  sophisticated  algorithms  may  be  scaled  to 
realistically  sized  problems  or  efficient  algorithms  may  be  scaled  to  larger  problems.  In  this  case, 
the  base  controller  determines  a  set  of  missions  to  AOR  points,  while  explicitly  accounting  for 
detailed  tasking  that  occurs  at  the  AOR  points.  We  expect  that  missions  from  base  will  be 
constructed  to  provide  sufficient  aircraft  at  each  AOR  point  to  perform  local  tasking.  This 
problem  is  complicated  by  the  potential  coupling  associated  with  air  defenses,  which  may  occur 
on  ingress,  egress,  and  among  AORs.  We  explicitly  account  for  the  ingress  coupling,  but  to 
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simplify  this  problem  we  neglect  egress  coupling  (i.e.,  all  air  packages  assume  full  responsibility 
for  air  defenses  encountered  during  egress)  and  the  coupling  among  AORs  (similarly,  each  AOR 
assumes  full  responsibility  for  air  defenses  in  the  region). 

The  set  of  missions  is  launched  from  base  to  specific  AORs.  When  the  air  packages 
arrive  at  the  AOR  points,  a  control  decision  is  required,  which  in  general  will  map  the  current 
state  Xj  to  an  appropriate  control  (i.e.,  set  of  missions  within  an  AOR)  m,  =  At  each  AOR 
point,  one  of  the  combinatorial  algorithms  is  used  with  a  single  stage  predictor  modified  to 
account  for  fact  that  aircraft  do  not  begin  at  base  and  that  only  a  subset  of  targets  are  available. 
Given  a  specific  state  Xj  and  control  ,  we  are  able  to  complete  the  2-stage  AOR  prediction. 

Neglecting  coupling  among  AORs  and  that  due  to  threats  during  egress,  the  base 
controller  uses  one  of  the  combinatorial  algorithms  to  assign  aircraft  to  AOR  point.  The  predictor 
is  modified  to  account  for  the  subsequent  tasking  of  aircraft  to  targets  when  they  arrive  at  an 
AOR.  This  is  illustrated  in  Figure  27.  Our  baseline  predictor  is  used  to  determine  the  value  (lost) 
during  ingress  to  the  AOR  points,  as  well  as  the  probabilistic  distribution  /(xJxo,Mo)  over  the 
state  X,  upon  arrival  at  the  AOR  points.  For  a  given  state,  we  are  able  to  determine  the  control 
~  which  assigns  aircraft  to  specific  targets  at  the  AOR  point.  To  evaluate  the  second 

stage  in  each  AOR,  we  average  the  performance  over  the  state  Xj  similar  to  the  2-wave/2-stage 
case.  The  performance  from  each  AOR  is  then  combined  to  form  the  overall  performance  of  the 
second  stage.  In  the  following,  we  discuss  several  approximations  used  to  evaluate  the  second 
stage  in  each  AOR.  Specifically,  we  consider  random  sampling  and  certainty  equivalence 
approximations  as  before,  as  well  as  a  partial  open-loop  approximation. 
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Figure  27  2 -Stage  AOR  Predictor 

4.6.5. 1  Random  Sampling 

In  Figure  28,  we  illustrate  the  random  sampling  approach  used  to  predict  the  performance 
within  each  AOR.  Our  design  model  is  used  to  predict  the  value  igress  to  the  AOR  and  provides 
the  (approximate)  distribution  /(xi|xo,Mo)  over  the  state  x,.  At  this  point,  we  assume  that  each 
AOR  is  independent.  For  each  AOR,  Monte  Carlo  integration  is  used  to  estimate  the 
performance  ^[/(^al^o’^o)]-  ^  state  Xj  is  selected  randomly  according  to  the  known  distribution 

/(xi|xo,Mo)  in  each  AOR.  The  control  u^=PMm{^i)  is  determined  based  on  that  state  x,.  Given 
the  state  Xj  and  the  control  Mj,  our  design  model  is  used  to  compute  the  value  in  each  AOR, 
y.[/(x2|xi,Mi,Xo,Mo)]  iho  intermediate  state  Xj,  from  which  we  estimate  the  second  stage 
performance  in  AOR  i. 

J  i\fh^2  I  )]  ~  TT  \f\^l  1^1  >  ^1  >  -^0  9  ^0  jJ 

^  ^  samples  ^’=1 

The  overall  performance  of  the  second  stage  is  determined  by  combining  the  value  of  the 
individual  AORs. 
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Figure  28  Random  Sampling  for  AOR  Prediction 
Using  the  Surrogate  Method  as  a  base  controller  and  the  MMR  as  an  AOR  controller,  this 
approach  requires  o(n •  Ns^p„,  -N^or'  ^^aor  )  single  stage 
evaluations  using  our  design  model.  is  the  number  of  asset  types,  i.e.,  2  in  this  case. 

Njjesiimiions  is  th®  Humber  of  destinations  assignable  from  base.  is  the  number  of  iterations 

the  Surrogate  Method  is  allowed  to  converge.  is  the  number  of  AORs.  is  the 

computational  complexity  of  the  MMR  assignment  at  each  AOR  point. 

4. 6. 5. 2  Certainty  Equivalent  Approximation 

In  Figure  29,  we  illustrate  a  certainty  equivalent  approach  used  to  predict  the 
performance  within  each  AOR.  Our  design  model  is  used  to  predict  the  value  ingress  to  the  AOR 
and  provides  the  (approximate)  distribution  Assuming  that  each 

AOR  is  independent,  a  certainty  equivalent  state  Xj  is  selected  to  represent  the  entire  distribution. 
In  our  experiments,  we  consider  the  mean  and  mode  of  the  distribution  in  each  AOR.  Given  the 
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certainty  equivalent  state  Xp  a  control  =  // determined  using  the  MMR  algorithm. 
Given  the  state  and  the  control  u^,  our  design  model  is  used  to  compute  the  estimated  value  of 
the  second  stage  in  AOR  i. 

The  overall  performance  of  the  second  stage  is  determined  by  combining  the  value  of 
individual  AORs. 


T  ys  r  ’I 

•^y(-^2|^o’^o)]  “  ^  y[y(^2l^o’^o)j 
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Figure  29  Certainty  Equivalent  Approximation  for  AOR  Prediction 
Using  the  Surrogate  Method  as  a  base  controller  and  the  MMR  as  an  AOR  controller,  this 
approach  requires  ■  MMR,ok  )  single  stage  evaluations 

using  our  design  model. 


4. 6. 5. 3  Partial  Open-loop  Approximation 

In  Figure  30,  we  illustrate  a  partial  open-loop  approach  to  predict  the  performance  within 
each  AOR.  Our  design  model  is  used  to  predict  the  value  ingress  to  the  AOR  and  provides  the 
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(approximate)  distribution  /(jfi|^o»"o)  ^^^®  Assuming  that  each  AOR  is 

independent,  we  consider  the  ^h  state  of  “gorilla”  air  package  and  adopt  a  certainty 

equivalent  state  xf'^^  for  the  targets  and  threats. 

The  “gorilla”  air  package  state  xf^"  and  the  certainty  equivalent  state  of  the  targets 
and  threats  are  combined  to  establish  the  state  x,.  A  control  u^  =  ju{xi)  is  determined  using  the 
MMR  algorithm.  Our  design  model  is  used  to  compute  the  value  in  AOR  i, 
y|/(jC2|xi^^*  as  a  function  of  the  “gorilla”  air  package  state.  The  estimated 

value  on  the  second  stage  in  AOR  i  is  given  by 


k 


•:;^Enemy 


5  Wi  ,  Xq  5 


The  overall  performance  of  the  second  stage  value  is  determined  by  combine  the  value  of 


individual  AORs 


j\f{x2  \Xo,Uo)]  =  '^Jl{f{x2  ko.Wo)]- 


■^0  ?  ^0  /J 
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Figure  30  Partial  Open-Loop  Approximation  for  AOR  Prediction 
Using  the  Surrogate  Method  as  a  base  controller  and  the  MMR  as  an  AOR  controller,  this 
approach  requires  *  ^^aor  )  single  stage  evaluations  using  our 
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design  model.  is  the  number  of  strike  aircraft  per  AOR.  is  the  number  of 

weasel  aircraft  per  AOR. 


4.6.6  2-Stage/l-Wave  Model  with  AOR  Tasking  and  ISR  Collection 

Similar  to  the  previous  AOR  section,  this  controller  allocates  air  packages  to  AORs, 
which  are  subsequently  tasked  to  specific  targets.  However  in  this  case,  we  consider  only  partial 
information,  i.e.,  not  all  objects  are  observed  prior  to  tasking  within  individual  AORs.  Similar  to 
the  previous  section,  this  addresses  scalability  by  decomposing  the  1-wave  problem  into  sub¬ 
problems,  but  also  begins  to  address  partial  observations  and  imperfect  information  at  the 
decision  point. 

Information  dynamics  describe  how  information  evolves  over  time.  These  dynamics  are 
captured  in  the  probabilistic  models  associated  with  each  battlespace  object,  as  well  as 
observation  dynamics.  Up  to  this  point,  all  objects  have  been  perfectly  observed  at  each  decision 
point.  In  this  case,  only  some  objects  will  be  observed,  while  the  probabilistic  state  of  others 
evolves  over  time.  In  the  absence  of  observations  (and  interactions),  the  probabilistic  state  will 
eventually  converge  to  a  steady  state  distribution  which  may  be  computed  directly  from  the 
probabilistic  model  associated  with  the  object.  In  Figure  31,  we  illustrate  the  information 
dynamics  associated  with  a  single  object.  Assuming  no  previous  observations,  the  initial 
information  state  correspond  to  the  steady  state  distribution  =  0.4  (red),  Pj^ctive  = 

(yellow),  and  Pu^„y,„  =  0.25.  There  is  no  change  in  the  information  state  until  an  observation 
occurs  at  which  point  we  observe  the  object  perfectly,  =  1.0.  At  that  point  (or  once  the 
object  is  no  longer  observed),  information  starts  to  degrade  (i.e,.  we  are  less  certain  what  state 
the  object  is  in).  If  a  decision  needs  to  be  made  regarding  this  object,  we  note  that  there  is  less 
ambiguity  and  thus  more  information  during  or  immediately  following  an  observation.  We 
expect  this  behavior  to  affect  the  types  of  decisions  that  that  the  base  controller  makes.  Namely, 
that  aircraft  should  be  sent  to  AORs  that  have  more  information  (less  ambiguity  regarding  the 
state  of  the  associated  objects). 
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Figure  31  Information  Dynamics 

In  this  case,  the  AOR  control  problem  is  the  same  as  in  the  previous  section,  except  that 
the  information  available  at  the  AOR  points  will  be  based  on  previous  partial  observations. 
Depending  on  the  amount  of  time  that  has  passed  since  the  last  observation,  information 
regarding  targets  and  air  defenses  will  degraded.  Knowing  where  and  when  observations  will 
occur,  we  expect  that  the  base  controller  will  send  aircraft  to  the  AORs  that  will  have  better 
information  at  the  associated  decision  point.  As  in  the  previous  case,  we  neglect  the  coupling 
among  AORs  and  that  due  to  threats  during  egress.  The  base  controller  uses  one  of  the 
combinatorial  algorithms  to  assign  aircraft  to  AOR  points,  and  the  predictor  is  modified  to 
account  for  the  subsequent  tasking  of  aircraft  to  targets  when  they  arrive  at  an  AOR.  This  is 
illustrated  in  Figure  32.  Our  baseline  predictor  is  used  to  determine  the  value  during  ingress  to 
the  AOR  points,  but  also  determines  the  distribution  from  which  the  observations 

7,  =  {zj  ,Z2 ,. . .}  are  drawn.  Observations  that  occur  during  ingress  are  projected  forward  (i.e.,  the 
information  is  degraded)  to  the  decision  point  so  that  a  current  estimate  Xj  =  P(xj|/j)  of  the  state 
may  be  used  to  determine  the  control  decision  u^=  fi ^he  AOR.  Given  the  control 
decision  m,  and  the  estimate  of  the  current  state Xj,  we  evaluate  the  second  stage  in  each  AOR, 
J.[/(xi|7i,Mi,/o,Wo)].  We  use  the  same  methods  that  were  used  in  the  previous  section  to 
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estimate  the  second  stage  performance,  except  that  the  random  sampling  and  certainty  equivalent 
approximations  are  taken  in  terms  of  the  observations  rather  than  the  current  state.  Since  the 
partial  open-loop  approximation  enumerates  the  air  package  state  xf’'‘ ,  which  we  assume  to  be 
perfectly  observable,  only  certainty  equivalent  is  considered  in  terms  of  the  observations. 


Figure  32  AOR  Prediction  with  Partial  Information 
4. 6. 6. 1  Random  Sampling 

In  Figure  33,  we  illustrate  the  random  sampling  approach  used  to  predict  the  performance 
within  each  AOR.  Our  design  model  is  used  to  predict  the  value  ingress  to  the  AOR  and  provides 
the  distribution  /(7,|/o,Mo)  which  the  observations  .}  are  drawn.  The 

observations  are  selected  randomly  according  to  the  known  distribution  and 

projected  forward  to  the  decision  point,  providing  the  current  state  estimate  Xj  =  Based 

on  the  state  estimate  Xp  a  control  m,  =  //^^(x,)  is  determined  for  each  AOR.  Given  the  state 
estimate  x,  and  the  control  Wp  our  design  model  is  used  to  compute  the  value  in  each  AOR, 
J.[/(x2|/i,Mp/o,Mo)]  given  the  intermediate  state  estimate  Xp  from  which  we  estimate  the  second 
stage  performance  in  AOR  i. 

1  ^sanples  _ 

^  ’  samples  ^ =1 
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The  overall  performance  of  the  second  stage  is  determined  by  combining  the  value  of  the 
individual  AORs. 


j[/(x2|/o,Mo)]=  X  •^.[/(^2l4.Wo)] 

/=1 


Figure  33  Random  Sampling  for  AOR  Prediction  with  ISR  Uncertainty 
Using  the  Surrogate  Method  as  a  base  controller  and  the  MMR  as  an  AOR  controller,  this 

approach  requires  ' Alterations-^ samples -^aor  -^ATR^op)  single  stage 

evaluations  using  our  design  model. 


4. 6. 6.2  Certainty  Equivalent  Approximation 

In  Figure  34,  we  illustrate  the  certainty  equivalent  approach  used  to  predict  the 
performance  within  each  AOR.  Our  design  model  is  used  to  predict  the  value  ingress  to  the  AOR 
and  provides  the  distribution  f[f\f,uj)  from  which  the  observations  /,  =  {z,,Z2,...}  are  drawn. 
The  certainty  equivalent  observation  is  selected  to  represent  the  entire  distribution  and 
projected  forward  to  the  decision  point,  providing  the  current  state  estimate  x,  =  Based 
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on  the  state  estimate  Jcj,  a  control  =  ju[x^)  is  determined  for  each  AOR.  Given  the  state 
estimate  x,  and  the  control  Wp  our  design  model  is  used  to  compute  the  value  in  each  AOR  i. 

The  overall  performance  of  the  second  stage  is  determined  by  combining  the  value  of  the 
individual  AORs. 


i[/(x2|/o,Mo)]=  £ 

/=1 


Figure  34  Certainty  Equivalent  Approximation  for  AOR  Prediction  with  ISR  Uncertainty 
Using  the  Surrogate  Method  as  abase  controller  and  the  MMR  as  an  AOR  controller,  this 
approach  requires  ■  N ■  N aor  -MMRaor)  single  stage  evaluations 

using  our  design  model. 

4. 6. 6. 3  Partial  Open-loop  Approximation 

In  Figure  35,  we  illustrate  a  partial  open-loop  approach  to  predict  the  performance  within 
each  AOR.  Our  design  model  is  used  to  predict  the  value  ingress  to  the  AOR  and  provides  the 
(approximate)  distribution  /(/,|7o,Mo)  from  which  the  observations  f  ={zi,Z2,...}  are  drawn. 
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We  consider  the  kth  state  of  the  “gorilla”  air  package  and  adopt  a  certainty  equivalent 
observation  for  the  targets  and  threats. 

The  “gorilla”  air  package  state  and  the  certainty  equivalent  observation  of  the 

targets  and  threats  are  combined  to  estimate  the  current  state  x,  =  Based  on  the 

state  estimate  Xj,  a  control  Wj  =  //(x,)  is  determined  for  each  AOR.  Given  the  state  estimate  Xj 
and  the  control  m,,  our  design  model  is  used  to  compute  a  value  in  AOR  i, 

as  a  function  of  the  “gorilla”  air  package  state.  The  estimated 

value  of  the  second  stage  in  each  AOR  i  is  given  by 

j[f{4h.  “.)] = E  K‘  •  '  "o)]  ■  ) 

k 

The  overall  performance  of  the  second  stage  value  is  determined  by  combining  the  value 
of  individual  AORs 


•/[/(^2  l^0>Wo)]  = 


MOfo  A  r  /  \i 

i=l 


Figure  35  Partial  Open-loop  Approximation  for  AOR  Prediction  with  ISR  Uncertainty 
Using  the  Surrogate  Method  as  a  base  controller  and  MMR  as  an  AOR  controller,  this 
approach  requires  0{A0Rs„i^^  ■  AOR^^„^^,  ■  '^^aor)  single  stage  evaluations  using  our 

design  model. 
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4.7  JAO  SCALABILITY  ASSESSMENT 

As  presented  in  Section  4.2,  the  goal  of  this  research  was  to  develop  ADP  algorithms  that 
produce  operationally  consistent  behaviors  for  realistic  sized  JAO  scenarios.  One  immediate 
complexity  reduction  was  achieved  by  adopting  a  hybrid,  multi-rate  control  architecture  that 
tailors  the  application  of  control,  i.e.  reactive  or  proactive,  for  the  battlespace  situation  at  hand; 
however,  additional  complexity  reduction  was  required.  Thus,  the  research  was  focused  on 
developing  fast  and  efficient  combinatorial  assignment  and  prediction  models  that  form  the 
foundation  of  the  ADP  algorithm.  In  this  end,  it  was  the  goal  of  this  research  to  develop  a 
spectrum  of  ADP  control  strategies  that  can  be  mixed  and  matched  to  provided  a  broad  range  of 
performance  and  computational  complexity.  This  was  achieved.  In  the  previous  four  sections, 
the  details  of  the  efficient  combinatorial  assignment  algorithms  along  with  the  fast  analytic 
prediction  models  were  presented.  Accordingly,  Figure  36  summaries  the  ADP  algorithms  that 
were  developed  and  implemented  as  part  of  this  research.  Again,  depending  on  the  battlespace 
situation,  different  assignment  algorithms  and  prediction  models  can  be  combined  to  produce  a 
tailored  application  of  control. 


Hlgorlttim* 

,_J — , 

Btit  PDtiletv 
RttMk 
Abofi 

Predict  Future  Value 


/6ontrol  Space  Assignment  Algorithm^ 

Retasker  using  Combinatorial  Rollout 
Aborter  using  Combinatorial  Rollout 
Target  Tasking  using  MMR 
AOR  Tasking  using  MMR 
Target  Tasking  using  Surrogate  Method 
AOR  Tasking  using  Surrogate  Method 
Target  Tasking  using  Combinatorial  Rolloup/ 

Future  Value  Prediction  Alaorithnis^ 

1 - Stage/1 -Wave 

2- Stage/1  -Wave  with  Retasking 
2-Stage/2-Wave : 'r/s 
4-Stage/2-Wave  with  Retasking 
2-Stage/1-Wave  with  AOR  Tasking 
2-Stage/1 -Wave  with  AOR  Tasking  &  iSRi 


Figure  36  ADP  Algorithms  Developed  and  Implemented  in  Hybrid,  Multi-Rate  Control 

Architecture 
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Having  developed  and  implemented  these  algorithms,  the  question  remains.  Do  these 
ADP  algorithms  produce  proactive,  operationally  consistent  behaviors  in  real-time  or  near  real¬ 
time  for  realistic  sized  JAO  scenarios?  The  behavioral  part  of  this  question  is  the  topic  of  the 
next  chapter.  Here  we  present  the  scalability  assessment  for  the  different  ADP  combinations 
implemented  for  generating  a  base  mission  queue  without  AOR  tasking.  Figure  37  presents  the 
computation  complexity  of  the  base  mission  controllers  which  represent  an  upper  bound  on  the 
computation  complexity  of  the  ADP  implemented  in  ALPHATECH’s  BMC3  Development 


Assignment  Algorithms 

““  Maximum  Marginal  Return  (MMR) 
Surrogate  Seeded  with  MMR 
Surrogate  with  Restarts 
Combinatorial  Rollout 

Prediction  Models 

■  1-Wave/1-Stage  Markov  Chain 

4  1-Wave/2-Stage  Markov  Chain  with  Retask 

A  2-Wave/2-Stage  Markov  Chain 

□  2-Wave/4-Stage  Markov  Chain  with  Retask 

4  2-Wave/4-Stage  Simulation-in-the-Loop 
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Figure  37  Scalability  Assessment  of  ADP  Algorithms  Developed  and  Implemented  for  Hybrid, 

Multi-Rate  Control  Architecture  (Single  CPU) 

Environment.  Note,  the  shaded  area  represents  control  solutions  that  require  more  than  15 

minutes  to  compute.  Furthermore,  a  100-target  scenario  corresponds  to  approximately  30  target 

locations. 

The  assumptions  for  this  assessment  are  as  follows: 

•  Base  Mission  Controller  Generating  Mission  Queue 

•  Base  Resources:  24  Strike  and  12  Weasel  Aircraft 

•  Maximum  Package  (6  Strike,  2  Weasel) 

•  3.57  DMPIs  per  Target  Location 

•  MMR  for  Future  Base  Loop  Closures 

•  CE  Analytical  Predictors 

•  Performance  on  850  MHz  CPU 

•  Distributed  on  125  CPU  Array 
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It  is  seen  from  this  figure  that  there  is  true  a  spectrum  of  solution  approaches  in  terms  of 
computation  complexity.  Figure  38  illustrates  the  computation  complexity  if  is  assumed  that  the 
AOC  has  distributed  computational  capability. 


Assignment  Algorithms 

Maximum  Marginal  Return  (MMR) 
Surrogate  Seeded  with  MMR 
Surrogate  with  Restarts 
—  Combinatorial  Rollout 

Prediction  Models 

■  1-Wave/1 -Stage  Markov  Chain 

4  1-Wave/2-Stage  Markov  Chain  with  Retask 

A  2-Wave/2-Stage  Markov  Chain 

□  2-Wave/4-Stage  Markov  Chain  with  Retask 

^  2-Wave/4-Stage  Simulation-in-the-Loop 


10  100 
Number  of  Target  Locations 


Figure  38  Scalability  Assessment  of ADP  Algorithms  Developed  and  Implemented  for  Hybrid, 

Multi-Rate  Control  Architecture  (125  CPUs) 

Thus,  it  is  seen  from  the  two  figures  above  that  there  is  a  variety  of  base  mission 
controller  that  can  generate  mission  queues  in  near  real-time  for  realistic  sized  JAO  scenarios.  In 
particular,  the  distributed  MMR  with  analytical  4-stage/2-wave  predictors  provides  near  real¬ 
time  computation  performance,  for  scenarios  with  approximately  60  target  locations.  Likewise, 
the  distributed  Surrogate  Method  with  analytical  2-stage/2-wave  predictors  provides  near  real¬ 
time  computation  performance,  for  scenarios  with  approximately  60  target  locations. 

In  summary,  this  scalability  assessment  illustrated  that  real  time  or  near  real-time 
computational  performance  is  achievable  for  most  of  the  ADP  algorithms  developed  and 
implemented  as  part  of  this  research.  Furthermore,  this  scalability  assessment  indicated  that 
many  of  the  ADP  controllers  could  provide  near  real-time  performance  for  scenarios  with  250 
targets  if  some  modest  parallel  computation  capability  was  available  in  an  AOC. 
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5  EXPERIMENTATION  RESULTS 

In  this  section,  experimental  results  that  illustrate  proactive,  operationally  consistent 
control  strategies  for  the  ADP  algorithms  developed  in  the  previous  section  will  be  presented. 

The  primary  emphasis  of  this  experimentation  is  to  demonstrate  the  benefits  of  proactive  versus 
reactive  control  strategies  for  relevant  JAO  problems.  In  this  end,  control  behaviors  and 
empirical  simulation  results  will  be  presented  to  highlight  the  differences  between  the  two 
control  paradigms.  All  experimental  results  were  generated  using  ALPHATECH’s  BMC 
Development  Environment. 

The  presentation  of  these  experimental  results  is  organized  such  that  we  begin  with 
simple  JAO  problems  and  then  build  upon  them  in  both  terms  of  scenario  complexity  and 
controller  complexity. 

5.1  DEMONSTRATION  SCENARIO 

The  scenario  used  for  this  assessment  is  based  on  the  Cyberland  Scenario  provided  by  the 
DARPA/JFACC  Program  Office.  For  this  scenario,  it  has  been  assumed  that  a  higher  level  of 
decomposition,  both  spatial  and  temporal,  of  the  JFACC  objectives  has  been  performed.  The 
enclosed  scenario  represents  a  24-hour  segment  of  the  air  campaign  and  the  Area  of 
Responsibility  (AOR)  is  the  Northern  Air  Defense  (AD)  District  in  West  Cyberland.  Key 
features  of  this  scenario  include  approximately  100  targets  and  threats  and  36  air  vehicle  assets 
of  which  there  are  24  generic  strike  aircraft  and  12  generic  weasel  aircraft.  As  noted  in  Section 
3,  targets  may  be  known,  unknown  (hidden),  time  critical,  or  destroyed.  Threats  may  be  active, 
inactive,  unknown,  repairing,  or  destroyed.  There  is  also  uncertainty  in  the  interaction  between 
enemy  and  friendly  assets  that  depends  on  the  characteristics  of  the  target  or  threat  and  the 
composition  of  the  air  package  involved,  i.e.  the  number  of  strike  and  weasel  aircraft,  and  also  on 
the  geometry  of  the  interaction. 

Figure  39  illustrates  the  target/threat  laydown  used  for  this  experimental  evaluation.  It  is 
seen  from  this  figure  that  there  are  24  normal  target  locations,  which  are  represented  by  A 
normal  target  is  a  generic  target  that  is  known  for  all  time  and  has  4  Designated  Mean  Points  of 
Impact  (DMPIs).  The  scenario  also  contains  4  TCT  regions,  each  containing  2  emerge  locations. 
In  this  context,  TCTs  are  static,  and  are  characterized  by  a  single  DMPI  per  emerge  location. 

The  TCTs  emerge  and  hide  based  on  stochastic  processes;  through  the  combination  of  temporal 
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and  event-based  stochastic  transitions,  the  TCTs  are  on  average  only  vulnerable  for 
approximately  15  minutes  in  this  scenario.  When  the  TCT  is  hiding,  it  is  represented  by  ♦,  and 
when  vulnerable,  it  is  represented  by  It  is  also  seen  from  this  figure  that  the  scenario 
contains  13  SAM  sites  of  varying  size.  The  position  of  the  SAM  site  in  this  scenario  is 
represented  by  and  the  ring  around  the  SAM  represents  its  lethal  range.  The  color  of  this 
ring  represents  the  status  of  the  SAM  sites;  the  color  scheme  is  as  follows:  red  represents  radar 
on,  yellow  represents  radar  off,  green  represents  under  repair,  and  black  represents  unknown. 
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Figure  39  Demonstration  Scenario  Used  for  Experimental  Evaluation 
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In  terms  of  the  friendly  assets,  there  is  an  aircraft  carrier  positioned  off  the  northern  coast, 
which  contains  half  of  a  squadron  of  generic  strike  and  weasel  aircraft  available  for  this  JFACC 
objective.  There  is  a  total  of  24  strike  aircraft  and  8  weasel  aircraft.  As  mentioned  in  the 
previous  section,  the  controller  composes  and  tasks  air  packages  to  target  locations.  In  this 
scenario,  an  air  package  is  represented  by  and  its  mission  is  denoted  by  .  It  is  assumed 
in  this  scenario  that  the  maximum  air  package  size  is  6  strike  and  2  weasels.  It  is  also  assumed 
that  aircraft  are  assigned  to  air  packages  in  increments  of  2.  Given  the  air  package  composition 
and  tasking,  the  controller  has  the  capacity  to  define  a  risk  avoidance  route.  For  some 
experiments,  this  route  is  determined  a  priori,  and  in  others,  the  route  selection  is  part  of  the 
control  space.  Finally,  it  is  assumed  in  this  scenario  that  air  vehicles  do  not  have  adequate  range 
to  circumvent  this  defense  posture. 

Finally,  as  noted  in  Section  4,  a  relative  valuation  scheme  is  required  to  distinguish 
control  options.  For  this  scenario,  all  normal  targets  are  valued  at  40  points  and  TCTs  are  valued 
at  400  points.  In  terms  of  the  airborne  assets,  both  strike  and  weasel  aircraft  are  valued  at  40 
points  each.  Note,  SAM  sites  have  no  explicit  value,  however,  they  clearly  have  implicit  value 
given  that  they  affect  the  attrition  of  aircraft.  Given  this  valuation  scheme,  the  performance 
metric  used  for  the  control  optimization  is  the  sum  of  the  target  value  destroyed  minus  the 
aircraft  lost. 

In  summary.  Figure  39  illustrates  the  baseline  scenario  that  was  used  for  a  series  of 
experiments  that  illustrate  proactive,  near  real-time  control  performance  of  the  controllers 
presented  in  Section  4.  Again,  the  presentation  of  these  experimental  results  is  organized  such 
that  we  begin  with  simple  JAO  problems  and  then  build  upon  them  in  both  terms  of  scenario 
complexity  controller  complexity.  As  the  complexity  increases,  minor  changes  to  the  baseline 
scenario  were  required  and  will  be  called  out  in  the  appropriate  sections. 

5.2  1-STAGE/l-WAVE  PROBLEM 

The  first  set  of  experiments  were  performed  to  assess  the  accuracy  of  the  analytic 
predictors  over  a  set  of  control  decisions.  The  details  of  this  analytic  prediction  model  are 
contained  in  Section  4.6.1.  To  perform  this  assessment,  a  one-wave  problem  was  set  up  where 
the  initial  mission  queue  was  defined  a  priori  and  no  loop  closures  were  permitted  during  the 
execution  of  the  wave,  i.e.  no  retasks  or  aborts.  In  this  situation,  the  wave  begins  when  all  air 
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packages  are  launched  at  time  t=0,  and  the  wave  ends  when  all  air  packages  return  to  base. 
Results  will  be  presented  for  the  cases  when  the  initial  status  of  the  enemy  assets  is  either  known 
or  is  specified  by  a  distribution.  Finally,  straight  line  routing  was  used  for  all  air  packages  since 
this  condition  results  in  higher  attrition,  and  thus  will  highlight  the  prediction  accuracy  of  the 
baseline  analytic  prediction  model  presented  in  Section  4.6.1. 

The  first  experiment  that  was  run  to 
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make  this  assessment  was  for  the  case  when 
the  initial  status  of  the  enemy  assets  was  know 
a  priori.  This  situation  is  illustrated  in  Figure 
40  where  the  initial  state  is  known  at  t=0,  and 
the  battlespace  state  is  propagated  to  the  wave 
retum-to-base.  Using  the  prediction  model 
presented  in  Section  4.6.1,  the  estimated 
performance  of  the  air  package  assignment  was  obtained  and  compared  to  simulation  runs.  With 
10,000  Monte  Carlo  performance  runs,  the  performance  prediction  using  the  1 -stage/ 1 -wave 
prediction  model  was  statistically  equivalent  to  the  simulated  performance. 

Given  the  experimentally  demonstrated 


Figure  40  Battlespace  State  Propagation 
Diagram  for  Known  Initial  State 
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Figure  41  Battlespace  State  Propagation 
Diagram  for  Unknown  Initial  State 


accuracy  of  the  1 -stage  prediction  model  given 
a  known  initial  state,  the  next  step  was  to 
assess  the  accuracy  of  the  prediction  model  for 
an  uncertain  initial  state  of  the  enemy.  As 
discussed  in  Section  4.6,  many  of  the  multiple 
state  prediction  models  require  the  propagation 
of  the  battlespace  state  given  uncertain  enemy  status.  As  in  the  previous  evaluation,  the  air 
package  composition  and  tasking  is  specified  as  part  of  the  initial  state  Xo-  Figure  41  illustrates 
the  1 -stage  prediction  problem  for  the  case  where  the  initial  state  is  defined  by  a  distribution. 
Given  the  air  packages  and  enemy  status  distributions  at  t=0,  the  battlespace  state  was 
propagated  to  the  retum-to-base  without  any  retasking  or  aborts.  Based  on  the  retum-to-base 
distribution  obtained  from  the  prediction  model,  the  expected  performance  was  computed.  For 
the  empirical  evaluation,  2,000,000  Monte  Carlo  evaluations  were  obtained  by  first  generating 
200  realizations  of  enemy  status  at  t=0  and  then  for  each  sample,  obtaining  10,000  samples  of  the 
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expected  performance.  Based  on  these  evaluations,  it  was  determined  that  the  1 -stage  prediction 
model  was  statistically  optimistic  by  6%. 


5.3  2-STAGE/l-WAVE  WITH  RETASKING  PROBLEM 

The  next  set  of  experiments  were  performed  to  assess  the  performance  of  the  2-Stage/ 1- 
Wave  Retasking  model  in  both  terms  of  prediction  quality  and  control  solution  quality.  The 
details  of  this  analytic  prediction  model  are  contained  in  Section  4.6.2.  For  this  1-wave  problem, 
all  air  packages  launch  at  t=0,  and  when  a  TCT  emerges,  the  retasker  controller  outlined  in 
Section  4.5.1  is  called  to  divert  ingress  air  packages  to  the  TCT.  As  before,  the  wave  ends  when 
all  air  packages  return  to  base.  As  noted  in  Section  4.6.2,  the  difficulties  of  modeling  the  future 
retasking  is  the  loop  closure  is  dependent  on  the  TCT  emerge  event  which  is  stochastic  based  and 
multiple  TCT  emerge  events  can  occur  during  a  single  wave.  The  results  of  this  assessment  will 
be  presented  below. 

To  perform  the  prediction 
accuracy  assessment,  the  one- 
wave  problem  was  set  up  where 
the  initial  mission  queue  was 
defined  a  priori  and  this  mission 
queue  was  executed.  Figure  48 
illustrates  the  2-Stage/l-Wave 
retasking  prediction  problem 
where  the  initial  state  includes  the 

initial  mission  queue  and  the  Figure  42  Battlespace  State  Propagation  for  2-Stage/l- 

^  ,  .„  Wave  Retasking  Problem 

enemy  status  at  r=0.  Results  will 

be  presented  for  the  cases  when  the  initial  status  of  the  enemy  assets  is  either  known  or  is 
specified  by  a  distribution.  To  perform  this  assessment,  the  analytic  model  with  retasking  was 
compared  to  experimental  evaluation  and  to  the  prediction  model  that  does  not  model  the  retask 
loop  closure,  i.e.  reactive  control  strategy. 
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.99.5%  Confidence 


Figure  43  illustrates  the 
prediction  accuracy  of  the  two 
analytic  prediction  models  compared 
to  empirical  evaluation.  For  the 
empirical  evaluation,  10,000  Monte 
Carlo  evaluations  were  performed  to 
generate  the  empirical  performance 
prediction.  It  is  seen  from  the  figure 
that  the  analytic  model  that  accounts 
for  the  retasking  loop  closure  is  within 
1 1%  of  the  true  expected 
performance.  Furthermore,  the 
analytic  model  that  does  not  account 
for  the  retasking  loop  closure  is  only  within  50%  of  the  true  expected  performance. 

Next,  the  prediction  accuracy 


1-Stage  2-Stage  Empirical 


Figure  43  Prediction  Accuracy  of  Different  Design 
Models  for  the  2-Stage/ 1 -Wave  Retasking  Problem 
with  Known  Initial  State  xo 
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assessment  was  performed  for  the 
situation  where  the  initial  status  of  the 
enemy  is  uncertain  and  is  specified  by 
distributions.  Given  the  air  packages 
and  enemy  status  distributions  at  t=0, 
the  battlespace  state  was  propagated 
to  the  retum-to-base  with  retask  loop 
closures  for  TCT  emerge  events.  As 
for  the  known  enemy  status  case, 
results  were  obtained  from  the  2- 
Stage/l-Wave  Retasking  Predictor,  1- 
Stage/l-Wave  Predictor,  and 
experimental  evaluation.  For  the  empirical  evaluation,  33,500  Monte  Carlo  evaluations  were 
obtained  by  first  generating  335  realizations  of  enemy  status  at  t=0  and  then  for  each  sample, 
obtaining  1,000  samples  of  the  expected  performance.  Figure  44  illustrates  the  prediction 
accuracy  of  the  two  analytic  prediction  models  compared  to  the  empirical  results.  It  is  seen  from 
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Figure  44  Prediction  Accuracy  of  Different  Design 
Models  for  the  2-Stage/l-Wave  Retasking  Problem 
with  Unknown  Initial  State  xo 
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the  figure  that  the  analytic  model  that  accounts  for  the  retasking  loop  closure  is  statistically 
equivalent  to  the  true  expected  performance.  Furthermore,  the  analytic  model  that  does  not 
account  for  the  retasking  loop  closure  provides  a  poor  prediction  and  is  only  6%  of  the  true 
expected  performance.  Thus,  it  is  seen  from  this  and  the  previous  assessment  that  2-stage/ 1- 
wave  retasking  analytic  prediction  model  does  provide  a  good  approximation  to  the  true 
expected  performance.  Furthermore,  not  unexpectedly,  the  1 -Stage/ 1 -Wave  analytic  prediction 
model  does  not  provide  a  good  approximation  to  the  true  expectation. 

Having  presented  the  prediction  accuracy,  the  question  remains  whether  the  higher 
fidelity,  i.e.  proactive,  prediction  model  that  anticipates  the  future  retasking  loop  closures 
improves  the  base  mission  control  solution.  To  perform  this  assessment,  two  mission  queues 
were  generated;  one  using  the  2-stage  analytic  predictor  that  anticipates  the  retasking  loop 
closure  and  one  using  the  1 -stage  analytic  predictor  that  neglects  the  retasking  loop  closure.  In 
both  cases,  the  Surrogate  Method,  which  was  presented  in  Section  4.3.3,  was  used  to  search  the 
control  space,  and  the  initial  status  of  the  enemy  and  thus  the  initial  state  xo  is  known.  Figure  45 
illustrates  the  two  mission  queues  produced. 
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Figure  45  Behavioral  Comparison  of  Proactive  Versus  Reactive  Control  Strategy  for  2-Stage/l- 

Wave  Retasking  Problem 

It  is  seen  from  this  figure  that  the  mission  queue  generated  using  the  reactive  prediction 
model  assigns  air  packages  to  TCT  locations.  In  comparison,  the  mission  queue  generated  using 
the  proactive  prediction  model  does  not  assign  air  packages  to  any  TCT  locations,  but  instead 
strategically  locates  air  packages  such  that  each  TCT  has  a  minimum  of  two  retask  options.  The 
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significance  of  this  differing  mission  assignment  is  that  the  2-stage  model  anticipates  the  fact  that 
air  packages  in  the  vicinity  of  the  TCT  regions  can  be  retasked  in  the  event  that  a  TCT  emerges; 
if  no  TCT  emerges,  the  air  packages  strike  normal  targets.  Thus,  for  the  1-stage  mission  queue, 
air  packages  fly  to  TCT  location  and  if  no  TCT  emerges,  the  air  package  cannot  achieve  any 
positive  value.  On  the  other  hand,  for  the  2-stage  solution,  air  packages  fly  to  normal  targets  in 
the  vicinity  of  TCTs.  If  a  TCT  emerges  during  ingress,  a  retasking  solution  exists  since  an  air 
package  will  be  in  range;  if  no  TCT  emerges,  the  air  packages  proceed  to  their  normal  targets 
and  achieve  positive  value.  Thus,  by  anticipating  the  TCT  emerge  event  and  the  subsequent  loop 
closure,  resources  can  be  more  effectively  tasked. 

Having  illustrated  the  behavioral 
differences  between  the  base  mission 
solutions,  the  performance  numbers  will 
now  be  presented.  Figure  46  illustrates  the 
performance  and  empirical  performance 
prediction  for  the  two  mission  queues. 

Note,  empirical  results  were  obtained  using 
10,000  Monte  Carlo  samples.  It  is  seen 
from  this  figure  that  2-stage  control  solution 
exhibits  a  statistically  significant 
performance  improvement  over  the  1 -stage 
control  solution.  Furthermore,  as  expected, 
the  predicted  performance  of  the  two  solutions  is  pessimistic.  In  the  case  of  the  1 -stage  solution, 
the  predicted  performance  does  not  account  for  any  retasks,  whereas  the  empirical  solution  does 
permit  reactive  retasking  to  a  TCT  emerge  event.  On  the  other  hand,  the  lower  predicted 
performance  of  the  2-stage  model  was  identified  in  previous  paragraphs  to  be  related  to  the 
approximations  relative  to  the  uncertain  loop  closure  time. 
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Figure  46  Control  Performance  for  the  2- 
Stage/l-Wave  Retasking  Problem 
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1 -Stage 


2-Stage 


Figure  47  Controller  Computational 
Performance  for  the  2 -Stage/1 -Wave  Retasking 
Problem 


The  final  piece  to  the  performance 
story  is  to  compare  the  proactive  versus 
reactive  controller  complexity  for  this 
particular  scenario.  Figure  47  illustrates  the 
number  of  1 -stage  prediction  function  calls 
required  to  produce  a  control  solution  for 
the  different  control  approaches.  Note,  as 
highlighted  in  Section  4.6.3,  a  single  stage 
prediction  model  may  require  hundreds  of 
one-stage  prediction  function  calls  to 
produce  the  estimated  performance  for  a 
control  option.  It  is  seen  from  this  figure 
that  the  2-stage  prediction  algorithm  only 

requires  a  marginal  0.5  orders  of  magnitude  more  function  calls.  In  terms  of  clock  time,  the 
reactive,  1 -stage  algorithm  computes  a  base  mission  queue  in  ~1  minutes,  whereas  the  proactive, 
2-stage  algorithm  computes  a  higher  quality  mission  queue  in  ~5  minutes. 

To  summarize  these  results,  the  2-Stage/l-Wave  with  Retasking  problem  illustrates  the 
benefits  of  anticipating  future  TCT  emerge  events  and  subsequent  retasking  controller  loop 
closures.  It  was  shown  that  the  analytic  prediction  model  that  approximates  the  future  loop 
closure  does  provide  an  accurate  prediction  for  a  given  control  option  both  for  the  cases  where 
the  initial  enemy  status  is  deterministic  or  defined  by  distributions.  Furthermore,  it  was  shown 
that  using  the  analytic  prediction  model  that  anticipates  the  retasking  loop  closure  produces 
proactive  control  solutions  that  strategically  position  air  packages  near  TCT  regions.  Finally, 
experimental  results  show  that  the  anticipatory  approach  provides  a  statically  significant 
performance  improvement  over  control  solution  that  only  reacts  to  the  TCT  emerge  event. 


5.4  2-STAGE/2-WAVE  PROBLEM 

In  the  previous  section,  the  benefits  of  proactive  versus  reactive  control  were  illustrated 
for  the  case  when  the  prediction  horizon  was  1-wave  but  included  the  potential  for  retasking  loop 
closures.  In  this  assessment,  the  complexity  of  the  problem  is  increased  by  extending  the 
prediction  horizon  to  include  the  second  wave,  without  the  potential  for  retasking  loop  closures. 
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Thus,  this  experiment  will  highlight  the  benefits  of  performing  proactive  control  over  a  2-wave 
problem.  Accordingly,  the  2-Stage/2-Wave  prediction  model  presented  in  Section  4.6.3  will  be 
used  for  this  experiment. 

To  show  this  benefit,  the  scenario  was  constructed  such  that  all  normal  targets  have 
terminal  time  constraints,  i.e.  expiration  deadlines,  on  their  value,  and  these  constraints  were 
established  such  that  they  become  active  during  the  second  wave  execution.  Additionally, 
routing  risk  was  added  to  the  control  space;  in  this  context,  the  controller  could  choose  either  a 
high  or  low  risk  route  for  the  entire  wave.  The  combination  of  target  value  deadlines  and  routing 
provides  a  design  trade  of  managing  risk  and  execution  time — through  either  target  tasking  or 
route  selection — to  maximize  the  two-wave  performance.  Thus,  for  this  two-wave  problem,  the 
controller  determines  the  initial  mission  queue  and  routing  option  using  a  two-wave  prediction 
model.  Then,  the  air  packages  launch  at  t=0  and  ingress  to  their  respective  targets,  deploy  their 
munitions,  and  egress  to  base.  When  all  air  packages  return  to  base,  the  controller  then 
determines  the  second  wave  mission  queue  and  routing  option  using  a  1-wave  prediction  model. 
Again  the  air  packages  launch  from  base,  ingress  to  their  respective  targets,  deploy  their 
munitions,  and  return  to  base.  The  second  wave — ^and  the  experiment —  is  complete  when  all  air 
packages  return  to  base.  For  this  experiment,  all  mission  queues  were  generated  using  the 
Maximum  Marginal  Return  assignment  algorithm  discussed  in  Section  4.5.3. 

As  noted  in  Section  4.6.3,  the  difficulty  of  a  2-wave  prediction  is  modeling  the  loop 
closure  and  subsequent  mission  generation  at  the  end  of  the  first  wave.  This  is  a  difficult 
problem  because  there  is  a  combinatorially  large  number  of  possible  states  at  the  end  of  the  first 
wave.  As  a  further  complication,  all  of  the  assignment  algorithms  outlined  in  Section  4.3 
produce  a  discrete  mission  queue  for  a  discrete  number  of  resources;  thus,  there  is  no  control 
algorithm  that  maps  a  surviving  aircraft  distribution  to  a  mission  queue  distribution.  Given  these 
complexities,  a  variety  of  2-stage/2-wave  prediction  models  will  be  assessed  in  this  section.  As 
in  the  previous  sections,  an  assessment  of  the  prediction  quality  and  control  performance  will  be 
made  for  each  of  these  approaches. 
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Figure  48  Battlespace  State  Propagation  for  2-Stage/2- 
Wave  Problem 


To  perform  the  prediction 

accuracy  assessment,  the  two-  Task  Air 

wave  problem  was  set  up  where 
the  initial  mission  queue  was 
defined  a  priori',  thus,  the  only 
loop  closure  occurs  when  the  first 
wave  ends  and  the  base  controller 
determines  the  second  wave 

mission  queue.  Figure  48  (0th  Stage)  (1st  Stage)  (2nd  Stage) 

illustrates  the  2-stage/2-wave 
prediction  problem  where  the 
initial  state  includes  the  initial 

mission  queue  and  the  enemy  status  at  t=0.  The  battlespace  state  is  propagated  to  the  1®‘  wave 
return  to  base,  and  based  on  some  approximation,  the  2"**  wave  mission  queue  is  modeled.  Given 
the  2"**  wave  mission  queue  and  the  distribution  over  the  enemy  status,  the  battlespace  state  is 
propagated  to  the  end  of  the  2"^  wave.  Based  on  the  distribution  of  the  battlespace  at  the  end  of 
the  second  wave,  the  performance  metric  is  computed. 

To  perform  this  assessment, 
the  2-wave  prediction  models 
presented  in  Section  4.6.3,  which 
include  random  sampling,  certainty 
equivalence,  and  aggregated 
approximation,  are  compared  to 
experimentally  generated  expected 
performance.  Figure  49  illustrates  the 
prediction  accuracy  of  the  different 
two-wave  prediction  models 
compared  to  empirical  evaluation.  For 
the  empirical  evaluation,  10,000 
Monte  Carlo  evaluations  were 
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Figure  49  Prediction  Accuracy  of  Different  Design 
Models  for  the  2 -Stage/2 -Wave  Problem 


performed  to  generate  the  empirical  performance  prediction.  It  is  seen  from  this  figure  that  the 
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two  random  sampling  approaches  are  statistically  equivalent  to  the  empirical  expected 
performance.  Note,  RS  #  represents  random  sampling  using  #  realizations  of  the  battlespace 
state  at  the  end  of  the  first  wave;  for  each  realization,  the  mission  controller  is  called  to  produce 
the  second  wave  mission  queue.  It  is  also  seen  from  this  figure  that  the  two  certainty 
equivalence  approaches  are  within  21%  of  the  true  expected  performance.  Note,  CE  Mean/Mode 
represents  certainty  equivalence  using  the  mean/mode  of  the  battlespace  state  at  the  end  of  the 
first  wave  to  generate  a  single  2"*^  wave  mission  queue.  Finally,  it  is  seen  from  this  figure  that 
the  aggregate  approximation  approaches  are  within  64%  of  the  true  expected  performance.  Note, 
AG  Num  represents  the  aggregation  approach  that  reduces  the  number  of  air  packages  that  are 
launched  and  AG  Size  represents  the  approach  that  reduces  the  size  of  the  air  packages  launched. 
Thus,  a  spectrum  of  2-wave  prediction  models  with  varying  accuracy,  randomness,  and 
computation  complexity  exist.  We  now  direct  our  attention  to  how  well  these  models  support  the 
proactive  control  decision  being  made  during  the  first  wave  mission  generation. 

Having  presented  the  prediction  accuracy,  the  question  remains  whether  the  higher 
fidelity  prediction  models  improve  the  initial  base  mission  control  solution.  To  perform  this 
assessment,  initial  mission  queues  were  generated  using  all  of  the  2-wave  prediction  models 
presented  above.  Again,  the  MMR  assignment  algorithm  was  used  to  search  the  control  space. 

To  provide  a  comparison  between  proactive  and  reactive  control  techniques,  mission  queues 
were  generated  using  the  1 -stage/ 1 -wave  prediction  model  and  the  2-stage/2-wave  prediction 
models.  In  both  cases,  the  initial  status  of  the  enemy  and  thus  the  initial  state  xo  is  known.  Figure 
50  illustrates  the  initial  mission  queues  produced  by  the  reactive  control  technique  and  a 
representative  proactive  control  technique  using  one  of  the  2-stage/2-wave  prediction  models.  It 
is  seen  form  this  figure  that  the  proactive  control  approach  chooses  a  low  risk  route  for  the  initial 
wave  but  chooses  not  to  attack  the  furthest  distance  targets.  In  contrast,  the  reactive  control 
solution  also  chooses  low-risk  routes,  but  chooses  to  strike  targets  that  require  longer  ingress  and 
egress  times.  The  net  impact  of  further  targets  and  low  risk  routes  increases  the  1-wave  control 
solution  expected  execution  time  by  30%  over  that  of  the  two-wave  control  solution.  This  30% 
increase  in  expected  wave  execution  time  severely  limits  the  target  opportunities  during  the 
second  wave. 
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Figure  50  Behavioral  Comparison  of  Proactive  Versus  Reactive  Control  Strategy  for  2-Stage/2- 


Having  illustrated  the 
behavioral  differences  between  the 
base  mission  solutions,  the 
performance  numbers  will  now  be 
presented.  Figure  51  illustrates 
the  empirical  performance  for  the 
different  mission  queues  generated 
using  1 -stage  and  2-stage 
prediction  models.  Note,  empirical 
results  were  obtained  using  10,000 
Monte  Carlo  samples.  It  is  seen 
from  this  figure  that  the 
performance  of  the  2-stage  control 
solutions  is  mixed.  From  the 
prediction  assessment  above,  it  is 
known  that  the  random  sampling  approach  provides  an  accurate,  albeit  noisy,  estimate  of  the 
expected  performance.  Thus,  this  prediction  model  serves  as  a  baseline  to  highlight  the 
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achievable  performance  improvement  over  the  1 -stage  approach.  Furthermore,  the  random 
sampling  approach  provides  a  baseline  to  compare  the  different  2-wave  approaches.  It  is  seen 
from  this  figure  that  the  baseline  proactive  control  approach  provides  a  93%  improvement  over 
the  expected  performance  of  the  reactive  control  approach.  Relative  to  the  other  2-wave 
approaches,  it  is  seen  that  the  certainty  equivalence  using  the  mode  is  statistically  equivalent  to 
the  baseline,  the  certainty  equivalence  using  the  mean  is  within  14%  of  the  baseline,  the 
aggregate  approach  reducing  the  number  of  1®*  wave  air  packages  is  within  45%  of  the  baseline, 
and  the  aggregate  approach  reducing  the  size  of  the  wave  air  packages  is  within  52%  of  the 
baseline.  Thus,  the  certainty  equivalent  approaches  provide  superior  performance  over  the 
aggregate  approaches.  Furthermore,  it  is  believed  that  the  certainty  equivalence  using  the  mode 
provides  superior  performance  over  the  certainty  equivalence  using  the  mean  because  the  mode 
amplifies  the  differences  between  the  good  and  the  bad  control  options. 

The  final  piece  to  the 
performance  story  of  the  different 
proactive,  2-wave  approaches  is  to 
view  the  controller  complexity  for 
this  particular  scenario.  Figure  52 
illustrates  the  number  of  one-stage 
prediction  function  calls  required 
to  produce  a  control  solution  for 
the  different  control  approaches. 

Note,  as  highlighted  in  Section  4.6, 
a  single  2-stage  prediction  model 
may  require  hundreds  of  one-stage 
prediction  function  calls  to  produce 
the  estimated  performance  for  a  control  option.  It  is  seen  from  this  figure  that  the  baseline 
solution  requires  approximately  5.0  orders  of  magnitude  more  function  calls  than  the  1-wave 
control  solution.  Relative  to  the  2-wave  control  solution,  the  certainty  equivalence  approaches 
require  approximtely  2.0  orders  of  magnitude  less  function  call  than  the  baseline,  and  the 
aggregate  approaches  require  approximately  4.5  order  of  magnitude  fewer  function  calls. 
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Figure  52  Controller  Computational  Performance  for 
the  2-Stage/2-Wave  Problem 
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In  summary,  this  assessment  highlighted  the  benefits  of  performing  proactive  versus 
reactive  control  over  a  2-wave  problem.  As  highlighted  in  Section  4.6.3,  there  are  a  variety  of 
ways  to  approximate  the  loop  closure  that  generates  the  2"^*  wave  mission  queue,  and  we  have 
chosen  to  illustrate  random  sampling,  certainty  equivalence,  and  aggregate  approaches  where  the 
random  sampling  approach  was  used  as  a  baseline  to  compare  prediction  accuracy  and  control 
solution  quality.  From  this  assessment,  it  was  determined  that  proactive  control,  which 
anticipated  the  deadlines  in  the  second  wave  and  determined  wave  missions  to  minimize  risk 
and  execution  time,  provides  a  substantial  performance  improvement  over  non-anticipatory,  i.e. 
reactive,  control  strategies.  Furthermore,  it  was  shown  that  the  certainty  equivalence  control 
approach  does  provide  an  accurate  prediction  of  the  expected  2-wave  performance  and  does 
produce  high-quality  control  solutions  at  a  substantial  reduction  in  computation  complexity  when 
compared  to  the  baseline.  Finally,  it  was  shown  that  the  aggregated  approaches  neither  provide 
an  accurate  prediction  of  the  2-wave  predicted  performance  nor  provide  quality  control  solutions. 

5.5  2-STAGE/l-WAVE  MODEL  AOR  TASKING  UNDER  UNCERTAINTY 
PROBLEM 

In  the  previous  section,  the  benefits  of  proactive  versus  reactive  control  were  illustrated 
for  a  2-wave  problem  with  stringent  terminal  constraints.  In  this  assessment,  the  complexity  of 
the  problem  will  be  increased  to  include  AOR  tasking  under  uncertainty.  Thus,  this  experiment 
will  highlight  the  benefits  of  anticipating  the  arrival  of  information  and  closing  the  loop  in  the 
context  of  AOR  tasking.  Accordingly,  the  2-Stage/ 1 -Wave  AOR  Tasking  prediction  model 
presented  in  Section  4.6.6  will  be  used  for  the  experiment. 
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To  show  this  benefit,  the 
standard  scenario  presented  in  Section 
5.1  was  modified  by  making  all 
targets  of  the  TCT  type  and  by 
including  AOR  points  and  ISR 
collection  assets.  Given  that  ISR 
collection  is  explicitly  modeled  in  this 
scenario,  perfect  state  information 
about  the  enemy  status  is  not  assumed 
for  this  experiment  during  execution. 

Note  that  in  previous  experiments,  the 
battlespace  state  at  loop  closures  was 
always  perfectly  observed,  and 
uncertain  state  information  was  only 
captured  within  the  prediction  model 
for  future  loop  closures.  Figure  53 

illustrates  the  modified  scenario  used  for  this  experiment.  It  is  seen  in  this  figure  that  the 
scenario  now  includes  two  AORs  and  ISR  assets  that  have  a  priori  defined  missions  to  fly  from 
left  to  right.  Since  we  only  have  perfect  state  information  when  an  ISR  asset  is  within  range  of 
an  enemy  asset,  our  knowledge  of  the  enemy  state  is  represented  by  a  distribution  at  any  given 
time.  This  knowledge  is  represented  by  the  rings  in  the  SAM’s  lethal  range  where  the  different 
colors  represent  different  modes.  Thus,  the  area  of  the  ring  represents  the  probability  of  being  in 
a  particular  state.  If  an  ISR  asset  is  within  range  of  the  SAM,  the  color  of  the  SAM’s  range  is 
solid  to  reflect  perfect  observation.  Once  the  ISR  asset  is  out  of  range  of  the  SAM  site,  the  status 
information  begins  to  decay  according  to  a  transient  Markov  process,  and  eventually  achieves  a 
steady  state  distribution.  Figure  54  illustrates  this  transient  behavior. 


ISR  Collection 
Aircraft  . 


Figure  53  Modified  Demonstration  Scenario  Used  for 
AOR  Tasking  Under  Uncertainty  Problem 
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Having  no  information  about  the  enemy 
status  at  t=0,  it  is  reasonable  to  assume 
that  all  assets  are  in  a  steady  state 
condition.  For  this  experiment,  the  steady 
state  distribution  for  the  targets  is 
Pknown~0‘3f  P unknown  0.5,  and  Pdead~0.  For 
the  SAMs,  the  steady  state  distribution  is 
such  that  Pactive  0.4,  P inactive  0.35, 

Punknown~0.25,  P repair  —0,  and  P dead  0. 

The  execution  sequence  for  this 
scenario  is  as  follows: 

•  Based  on  uncertain  initial 
information,  it  is  assumed  that 
enemy  status  at  t=0,  i.e.  xo,  is  governed  by  steady  state  distributions. 

•  Given  this  information,  the  controller  composes  and  tasks  gorilla  air  packages  to  the 
AORs. 

•  The  gorilla  air  packages  launch  at  t=0  in  ingress  to  their  respective  AORs  using  straight 
line  routes. 

•  During  ingress,  the  ISR  assets  along  with  the  gorilla  air  packages  collect  observations 
about  the  enemy  status. 

•  Upon  each  gorilla  air  package’s  arrival  to  its  AOR,  a  controller  decomposes  it  into 
smaller  air  packages  and  assigns  them  to  particular  target  locations  based  on  the 
information  collection. 

•  Air  packages  ingress  via  straight  line  routes  to  their  respective  target  locations,  deploy 
munition,  and  egress  back  to  base. 

•  Wave  ends  when  all  air  packages  return  to  base. 

Given  these  modifications  to  the  base  scenario  and  the  execution  sequence  above,  the 
proactive  base  mission  controller  must  anticipate  the  ISR  collection,  information  degradation, 
and  the  AOR  loop  closure  mapping  in  order  to  make  optimal  gorilla  package  composition  and 
tasking.  Note  that  for  all  experimental  results  presented  in  this  section,  the  base  mission  queues, 
i.e.  gorilla  air  package  composition  and  tasking,  are  generated  using  the  Surrogate  Method 
discussed  in  Section  4.5.6.  Likewise,  all  AOR  taskings,  i.e.  decomposition  of  gorilla  air  package 
and  air  package  composition  and  tasking,  are  generated  using  the  Maximum  Marginal  Return 
assignment  algorithm  discussed  in  Section  4.5.3. 
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Figure  54  Markov  Transient  Response  of  Known 
A/ffsitp  Sltntus!  Distribution 
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As  noted  in  Section  4.6.6,  the  difficulty  of  a  2-stage/ 1 -wave  AOR  tasking  under 
uncertainty  prediction  model  is  modeling  information  arrival  and  degradation  and  the  loop 
closure  upon  arrival  to  the  AOR.  This  is  a  difficult  problem  because  there  is  a  combinatorial 
large  number  of  possible  states  upon  arrival  to  the  AOR  due  to  gorilla  package  attrition  and 
information  arrival  and  degradation.  As  a  further  complication,  all  of  the  assignment  algorithms 
outlined  in  Section  4.3  produce  a  discrete  mission  queue  for  a  discrete  number  of  resources;  thus, 
there  are  no  control  algorithms  that  map  a  surviving  aircraft  distribution  to  a  mission  queue 
distribution.  Given  these  complexities,  a  variety  of  AOR  tasking  prediction  models  will  be 
assessed  in  this  section.  As  in  the  previous  sections,  an  assessment  of  the  prediction  quality  and 
control  performance  will  be  made  for  each  of  these  approaches. 

To  perform  the 
prediction  accuracy 
assessment,  the  1-wave 
AOR  tasking  problem  was 
set  up  where  the  initial 
gorilla  mission  queue  was 
defined  a  priori.  Thus, 
the  only  loop  closures 
occur  when  the  gorilla  air 
packages  arrive  at  the 
AORs;  upon  arrival,  the 
AOR  tasking  controller 
determines  the  low  level  air  package  composition  and  tasking  to  target  locations.  Figure  55 
illustrates  the  2-stage/ 1 -wave  AOR  tasking  under  uncertainty  prediction  problem  where  the 
initial  state  includes  the  initial  mission  queue  and  the  enemy  status  at  t=0.  The  battlespace  state 
is  propagated  to  the  AOR  rendezvous  point,  and  based  on  some  approximation,  the  AOR  tasking 
is  modeled.  Given  the  AOR  tasking  and  the  distribution  over  the  enemy  status,  the  battlespace 
state  is  propagated  to  the  end  of  the  1®‘  wave.  Based  on  the  distribution  of  the  battlespace  at  the 
end  of  the  second  wave,  the  performance  metric  is  computed. 
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Figure  55  Battlespace  State  Propagation  for  2-Stage/l-Wave  AOR 
Tasking  Problem  with  ISR  Collection  and  Degradation 
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To  perform  this  assessment, 
the  2-stage  AOR  prediction  models 
presented  in  Section  4.6.6,  which 
include  random  sampling,  certainty 
equivalence,  and  partial  OLF,  are 
compared  to  experimentally 
generated  expected  performance. 

As  an  additional  comparison,  the 
1 -stage  prediction  model  that  does 
not  anticipate  future  information 
arrival  and  control  decisions  is 
included.  Figure  56  illustrates  the 
prediction  accuracy  of  the  different 
two-wave,  AOR  prediction  models 
compared  to  empirical  evaluation. 

For  the  empirical  evaluation,  23,500  Monte  Carlo  evaluations  were  performed  to  generate  the 
empirical  performance  prediction.  It  is  seen  from  this  figure  that  the  2-stage  AOR  prediction 
algorithms  are  consistently  optimistic  by  approximately  15%.  As  discussed  in  the  previous 
section,  the  random  sampling  approach  should  provide  an  accurate,  albeit  noisy,  estimate  of  the 
expected  performance.  However,  as  seen  in  the  above  figure,  the  random  sampling  approach  is 
producing  an  optimistic  estimate  of  the  expected  performance.  From  a  thorough  evaluation  of 
prediction  model,  it  was  determined  that  there  is  a  model  mismatch  between  the  prediction 
model  and  the  simulator.  In  the  simulator,  information  degradation  begins  when  the  ISR 
collection  asset  flies  out  of  range  of  the  enemy  asset;  however,  in  the  prediction  model, 
information  degradation  begins  immediately  after  detection,  i.e.  when  the  asset  first  comes  into 
range.  Thus,  the  prediction  model  is  overestimating  the  transient  time  for  information 
degradation.  Finally,  it  is  seen  in  the  above  figure  that  the  1 -stage  prediction  model,  which  does 
not  account  for  information  arrival  and  subsequent  AOR  loop  closures,  is  only  within  40%  of  the 
true  expected  performance. 

Having  presented  the  prediction  accuracy,  the  question  remains  whether  the  higher 
fidelity  prediction  models  improve  the  initial  gorilla  air  package  assignment.  To  perform  this 
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Figure  56  Prediction  Accuracy  of  Different  Design 
Models  for  the  2-Stage/ 1 -Wave  AOR  Tasking  Under 
Uncertainty  Problem 
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assessment,  initial  mission  queues  were  generated  using  all  of  the  2-stage  AOR  prediction 
models  presented  above.  To  provide  a  comparison  between  proactive  and  reactive  control 
techniques,  mission  queues  were  generated  using  both  the  1 -stage/ 1 -wave  and  2-stage/l-wave 
prediction  models.  Figure  57  illustrates  the  initial  mission  queues  produced  by  the  proactive  and 
reactive  control  techniques.  It  is  seen  from  this  figure  that  the  proactive  control  approach 
chooses  to  send  all  assets  to  the  AOR  for  which  information  becomes  available.  This  solution  is 
attained  by  all  the  2-stage  controllers,  i.e  random  sampling,  certainty  equivalent,  and  partial 
OLF.  On  the  other  hand,  the  1-stage  controller,  not  able  to  recognize  the  arrival  of  information, 
let  alone  its  benefit,  selects  a  uniform  allocation  of  resources  to  AORs.  These  results  are 
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Figure  5  7  Behavioral  Comparison  of  Proactive  Versus  Reactive  Control  Strategy  for  2-Stage/ 1- 

Wave  AOR  Tasking  Under  Uncertainty  Problem 
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Having  illustrated  the 
behavioral  differences  between  the 
base  mission  solutions,  the 
performance  numbers  will  now  be 
presented.  Figure  58  illustrates  the 
empirical  performance  for  the 
different  mission  queues  generated 
using  1 -stage  and  2-stage 
prediction  models.  Note,  empirical 
results  were  obtained  using  20,400 
Monte  Carlo  samples.  It  is  seen 
from  this  figure  that  the 
performance  of  the  2-stage  control 
solutions  are  all  identical,  which  is 
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Figure  58  Control  Performance  for  the  2-Stage/ 1 -Wave 
AOR  Tasking  Under  Uncertainty  Problem 


not  surprising  since  there  mission  queues  were  identical.  In  comparison  to  the  reactive  control 
approach,  the  proactive  controllers  produced  a  106%  improvement  in  expected  performance. 

The  final  piece  to  the 
performance  story  of  the  different 
proactive,  2-stage  approaches  is  to 
view  the  controller  complexity  for 
this  particular  scenario.  Figure  59 
illustrates  the  number  of  one-stage 
prediction  function  calls  required 
to  produce  a  control  solution  for 
the  different  control  approaches. 

Note,  as  highlighted  in  Section  4.6, 
a  2-stage  prediction  model  may 
require  hundreds  of  one-stage 
prediction  function  calls  to  produce 
the  estimated  performance  for  a 
control  option.  It  is  seen  from  this  figure  that  the  baseline  solution  requires  approximately  2.5 
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Figure  59  Controller  Computational  Performance  for  the 
2-Stage/l -Wave  AOR  Tasking  Under  Uncertainty  Problem 
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orders  of  magnitude  more  function  calls  than  the  1-wave  control  solution.  Relative  to  the  2-wave 
control  solution,  the  certainty  equivalence  approaches  require  approximtely  1.0  orders  of 
magnitude  less  function  call  than  the  baseline,  and  the  aggregate  approaches  require 
approximately  2.0  order  of  magnitude  fewer  function  calls. 

In  summary,  this  assessment  highlighted  the  benefits  of  performing  proactive  versus 
reactive  control  for  a  1-wave  AOR  tasking  under  uncertainty.  It  was  shown  that  control 
strategies  that  anticipate  future  information  collection,  degradation,  and  loop  closures  can 
provide  a  substantial  performance  improvement  over  control  strategies  that  only  react  to  future 
information. 
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6  CONCLUSIONS 

This  research,  performed  by  ALPHATECH,  focused  on  providing  military  commanders 
with  the  ability  to  perform  real-time  dynamic  control  of  military  air  operations  using  near 
optimal  mission  replannning  for  a  24-hour  segment  of  a  JAO  campaign  using  control  algorithms 
that  anticipate  possible  mission  modifications  due  to  uncertain  future  events.  The  near  real-time, 
near  optimal  control  decisions  being  produced  consist  of  the  generation/modification  of  mission 
definitions  for  both  assets  at  base  and  airborne  with  the  performance  goal  of  achieving  the 
specified  JFACC  objective  while  minimizing  the  friendly  asset  losses.  The  mission  definition 
includes  assignment  of  resources  to  targets,  high-level  routing  (by  specification  of  waypoints), 
strike  package  composition,  weapon  composition,  and  desired  time-on-target.  The  primary 
benefit  of  this  technology  is  agile  and  stable  control  of  distributed  and  dynamic  military 
operations  conducted  in  inherently  uncertain,  hostile,  and  rapidly  changing  environments. 

The  JAO  problem  size  investigated  includes  approximately  100  targets/threats  and  a 
mixture  of  50  airborne  assets  taken  from  two  generic  airborne  asset  types:  strike  and  weasel 
aircraft.  Strike  aircraft  attack  targets  and  weasel  aircraft  strike  surface-to-air  threats.  Risk  to  the 
air  packages  is  introduced  via  threats  such  as  were  surface  to  air  missiles.  The  size  of  operation 
we  chose  for  our  study  assumes  that  some  form  of  geographic  decomposition  of  the  battle  space 
has  been  specified,  and  that  we  are  concerned  primarily  with  achieving  the  specified  missions  for 
a  24  hour  period.  Hence,  there  is  implicit  value  in  saving  assets  for  future  operations  beyond  the 
specified  horizon. 

For  the  above  JAO  problem,  there  are  many  interesting  dynamics  that  make  it 
challenging.  The  scarcity  of  aircraft  resources  forces  multiple  “turns”  of  the  aircraft  in  order  to 
service  all  of  the  targets.  There  are  also  multiple  sources  of  uncertainty  in  the  problem.  There  is 
uncertainty  in  the  status  and  location  of  enemy  assets.  Targets  may  be  known,  unknown,  hiding, 
emerging,  time  critical,  or  destroyed.  Threats  may  be  active,  inactive,  unknown,  repairing,  or 
destroyed.  There  is  also  uncertainty  in  the  interaction  between  enemy  and  friendly  assets  that 
depends  on  the  characteristics  of  the  target/threat  and  the  relative  position  and  composition  of  the 
air  package,  i.e.  number  and  position  of  strike  and  weasel. 

Given  this  highly  uncertain  and  rapidly  changing  environment,  the  JAO  control  problem 
can  be  viewed  as  a  dynamic  decision  problem  under  uncertainty.  This  class  of  problems  can  be 
formulated  as  a  Markov  decision  problem.  Exact  techniques  for  control  design  using  this 


TR-1048 


100 


11/30/2001 


ALPHATECH.  Inc. 


Use  or  disclosure  of  data  marked  by  an  asterisk  C) 
is  subject  to  the  restrictions  on  the  cover  page. 


approach,  such  as  Stochastic  Dynamic  Programming  (SDP),  are  computationally  expensive,  and 
do  not  scale  up  to  the  size  of  the  JAO  problem  of  interest.  A  subtle  but  significant  attribute  of 
the  Markov  decision  problem  formulation  is  that  it  produces  control  strategies  that  anticipate  the 
effects  of  future  contingencies,  and  evaluates  the  possible  actions  over  all  possible  future  states, 
by  modeling  the  future  information  arrival  and  control  decisions.  It  is  this  fact  that  produces 
proactive  versus  reactive  control  behaviors.  This  proactive  attribute  is  desirable  for  stable  and 
agile  control  of  the  JAO  enterprise  because  future  information  arrival  and  control  opportunities 
are  dependent  on  stringent  spatial,  temporal,  and  coordination  constraints. 

Given  the  strengths  and  weaknesses  of  the  SDP  algorithm,  this  research  focused  on 
developing  Approximate  Dynamics  Programming  (ADP)  strategies  that  provide  the  desirable 
proactive  control  behaviors  but  with  near  real-time  computation  effort.  The  control  design 
technology  is  based  on  combining  hybrid  state  modeling  techniques  for  developing  statistical 
dynamical  models  relating  mission  decisions  to  evolution  of  objects  in  the  battlespace,  together 
with  ADP  control  design  techniques  that  have  demonstrated  real-time,  proactive  performance  for 
other  relevant  military  problems.  Accordingly,  a  spectrum  of  ADP  control  techniques  were 
developed  for  the  JAO  problem;  these  techniques  were  developed  in  discrete  event  simulations 
of  JAO  scenarios.  The  major  accomplishments  of  the  research  were: 

•  Translated  the  JAO  Control  Enterprise  into  a  Dynamical  Hybrid  State,  Discrete  Event, 
Stochastic  Decision  Making  Problem 

•  Integrated  Emerging  ADP  Technologies  into  JAO  Feedback  Controllers 

•  Experimentally  Demonstrated  the  Benefits  of  Feedback  Control 

•  Experimentally  Demonstrated  Benefits  of  Approximate  Optimal  Control 

•  Developed  Innovative  Hybrid,  Multi-Rate  Control  Architecture 

•  Developed  Computationally  Efficient  Control  Algorithms  that  Produce  Operationally 
Consistent  Behaviors 

•  Extended  Control  Algorithms  to  Accommodate  Hierarchical  Mission  Tasking  and  ISR 
Information  Collection 

In  summary,  our  investigations  demonstrated  the  feasiblility  of  automating  military 
operations  planning  to  provide  real-time,  near-optimal  control  strategies  that  achieve  operational 
objectives  while  minimizing  asset  losses.  By  adopting  a  hybrid,  multi-rate  control  architecture, 
we  were  able  to  tailor  the  application  of  control  at  a  time  scale  appropriate  to  the  operational 
situation  at  hand.  Within  the  proposed  multi-rate  architecture,  we  developed  a  spectrum  of  ADP 
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control  strategies  that  produce  a  range  of  control  decisions,  ranging  from  immediate  restasking  or 
abort  decisions  to  preplanned  multiple  wave  tasking.  The  solution  quality  and  computation 
performance  of  these  algorithms  was  tested  and  verified  in  a  JAO  discrete  event  simulator.  Our 
experiments  show  that  the  ADP  strategies  were  able  to  produce  operationally  consistent, 
proactive  control  strategies  that  anticipated  likely  contingencies  and  positioned  assets  for 
opportunities  of  recourse  all  in  either  real-time  or  near  real-time.  Furthermore,  a  scalability 
assessment  indicated  that  many  of  the  ADP  controllers  could  provide  near  real-time  performance 
for  scenarios  with  250  targets  with  some  modest  parallel  computation. 

The  results  of  this  investigation  can  be  extended  in  several  important  directions.  First, 
the  algorithms  developed  in  this  investigation  can  be  extended  to  include  further  modeling 
details  of  a  JAO  environment,  such  as  detailed  weaponeering,  additional  platform  types  and 
missions.  In  this  manner,  the  algorithms  could  then  form  the  basis  for  a  decision  aid  for  an  Air 
Operations  Center  (AOC).  This  decision  aid  would  assist  operators  in  rapidly  replanning 
missions  in  the  presence  of  contingencies,  and  help  to  generate  robust  Air  Tasking  Orders 
(ATO).  Second,  the  technology  developed  in  this  work  can  be  extended  to  design  robust 
autonomous  controllers  for  automated  vehicles  conducting  uncertain  missions. 
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8  APPENDICES 

As  noted  in  previous  sections,  some  of  the  results  that  where  document  in  technical 
memorandums  and  conference  proceeding  are  being  included  to  compliment  the  body  of  this 
report. 

8.1  TYPES  OF  CONTROL 

See  attached 

Wohletz,  J.M.,  “Optimal  Control  Solutions  for  Stochastic  Systems,”  ALPHATECH  TM-572,  Burlington,  MA,  2000. 

8.2  AEC  2000 

See  attached 

Bertsekas,  D.P.,  D.A.  Castahon,  et.al.,  “Dynamic  Programming  Methods  for  Adaptive  Multi-Platform  Scheduling 
in  a  Risky  Environment,”  DARPA  AEC  Symposium, ,  Minneapolis,  MN,  July  10-11, 2000. 

8.3  AEC  1999 

See  attached 

Bertsekas,  D.P.,  Ca  stanon,  D.  A.  and  et  al,  “Adaptive  Multi-platform  Scheduling  in  a  Risky  Environment,”  DARPA- 
JFACC  Advances  in  Enterprise  Control  (AEC)  Symposium,  1999. 

8.4  ACC  2001 

See  attached 

Wohletz,  J.M.,  D.A.  Castahon,  and  M.L.  Curry,  “Closed-Loop  Control  for  Joint  Air  Operations,”  IEEE  American 
Controi  Conference,  ACC01-INV3501 ,  Arlington,  VA,  2001 . 


8.5  SPIE  2001 

See  attached 

Cassandras,  C.G.,  K.  Gokbayrak,  D.  Castahon,  J.  Wohletz,  M.  Curry,  and  M.  Gates,  “Modeling  and  Agile 
Control  for  a  Joint  Air  Operation  Environment,”  Proceedings  of  SPIE  15th  Annual  Inti.  Symposium,  April 
2001. 
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TM-572 

Optimal  Control  Solutions  for  Stochastic  Systems 

Jerry  M.  Wohletz 

Abstract 

This  technical  memorandum  details  the  distinction  between  the  different  types  of  control  philosophies  that  are 
applicable  to  stochastic  optimal  control  problems.  Stochastic  optimal  control  is  defined  as  the  determination  of  control 
variables  or  parameters  that  minimize  some  well-defined  criterion  subject  to  the  evolution  of  a  stochastic  system  [1].  For 
this  control  problem,  different  optimal  control  policies  result  from  varying  the  assumptions  pertaining  to  the  information 
state  and  from  varying  the  assumptions  pertaining  to  the  optimal  control  structure.  This  paper  focuses  on  distinguishing 
the  differences  between  control  approaches  that  appear  to  be  equivalent,  but  in  fact  produce  different  behaviors.  Given 
the  different  optimal  control  philosophies,  a  few  control  techniques  will  be  explored. 

Keywords 

Stochastic  optimal  control,  approximate  dynamic  programming,  model  predictive  control. 


I.  Basic  Problem 


^^ONSIDER  the  a  stochastic  discrete-time  dynamical  system  [2] 

^k+i  =  fk  {o^k,  Uk,  Wk),  k  =  0,  L  ....  N  -  1 


(1) 


where  the  state  Xk  G  Sk,  the  control  Uk  G  Ck,  the  random  disturbance  Wk  G  Dk,  and  N  is  the  horizon 
of  interest.  The  control  Uk  is  constrained  to  take  values  in  a  given  nonempty  subset  Uk{xk)  C  Ck 
that  depends  on  the  current  state  Xk,  i.e.  Uk  G  Uk{xk)  V  G  Sk-  The  random  disturbance  Wk  is 
characterized  by  a  probability  distribution  Pk{-\xk,Uk)  that  may  depend  explicitly  on  Xk  and  Uk  but 
not  on  values  of  prior  disturbances  10^-1,  -  ■  ■  ,wo- 

For  all  optimization  problems,  some  well-define  performance  metric  is  required.  Here,  we  use  an 
additive  performance  metric  in  the  sense  that  the  cost  incurred  at  time  k,  denoted  by  gk{xk,Uk,Wk), 
accumulates  over  time.  Furthermore,  we  add  the  possibility  of  a  terminal  cost  denoted  by  gNix^)- 
Because  of  the  presence  of  the  random  disturbance  Wk,  the  performance  metric  is  a  random  variable 
and  the  optimization  must  be  performed  on  the  expected  cost 


^(^0)  —  E  wfc 

k=0,l,...,N-l 


9n{xn)  +  3k{xk,  Uk,  Wk) 


where  the  expectation  is  taken  with  respect  to  the  random  disturbances  Wk,  k  =  0,1, . . . ,  N  —  1. 


II.  Optimal  Control  Formulations 

In  the  basic  problem  formulation  above,  there  is  no  mention  of  the  form  of  the  optimal  control 
solution  or  the  availability  of  future  information;  depending  on  the  availability  of  future  information 
and  the  structure  of  the  control,  fundamentally  different  optimization  problems  result.  Based  on  the 
desired  optimal  behavior,  the  distinction  between  these  different  formulations  is  particularly  relevant 
when  selecting  a  control  design  technique. 

There  are  three  different  stochastic  optimal  control  formulations  presented  in  the  following  subsec¬ 
tions.  The  assumptions  pertaining  to  the  future  information  availability  and  structure  of  the  control 
is  presented  a^ong  with  comments  pertaining  to  implementation.  The  purpose  here  is  not  to  identify 
solution  techniques  to  the  optimization  problem,  but  instead  gain  some  understanding  between  the 
different  optimization  formulations. 

J.  M.  Wohletz  is  a  Senior  Systems  Engineer  at  ALPHATECH  Inc.,  50  Mall  road,  Burlington,  MA,  01803  USA,  (e-mail: 
jwohletz@alphatech.com) 
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A.  Open- Loop  Optimal  Control 

The  Open-Loop  Optimal  Control  (OLOC)  formulation  assumes  that  no  future  realizations  of  the 
state  are  measurable;  as  a  result,  the  control  structure  is  purely  a  function  of  time  and  consists  of 
a  sequence  of  actions,  i.e.  Uo  =  {uo,  Ui, . . . ,  un-i}-  Given  this  assumption,  the  optimization  of  (2) 
subject  to  (1)  over  a  sequence  of  controls  Uq  =  (uq,  Ui, . . . ,  un_i}  based  on  a;(0)  is  as  follows; 


N-l 


u*  =  arg  min  E  w^, 

uo€U  k=0,l,...,N-l 


gN(xN)+  Xlgk(Xk,Uu,Wk) 


k~0 


(3) 


subject  to  the  constraints 


Xk+i  =  fk{xk,Uk,Wk) ,  Uk  e  Uk{xk)  y  Xk  e  Sk,  k  =  -  l  (4) 


Thus,  this  optimal  control  approach  produces  a  control  solution  that  is  solely  a  function  of  time,  and 
this  solution  is  not  updated  as  more  information  becomes  available. 

It  is  important  to  note  that  the  optimal  control  sequence  Uq  accounts  for  the  uncertainty  wo,Wi, , 
Wkf-i.  If  the  initial  state  Xq  is  uncertain,  then  the  expectation  in  (3)  would  be  with  respect  to 
xo,wo,wi, . . .  Also,  for  large-scale  stochastic  problem,  this  optimal  solution  is  by  no  means 

trival  since  the  expectation  over  all  disturbances  is  required. 


B.  Closed-Loop  Feedback  Optimal  Control 

The  Closed-Loop  Feedback  Optimal  Control^  (CLFOC)  assumes  that  future  realizations  of  the 
state  are  measurable  and  that  the  control  structure  consists  of  an  optimal  rule  Pk{xk)  —  control  law 
—  at  each  time  k  that  maps  all  feasible  realizations  of  x{k)  to  feasible  u{k),  i.e.  Uk  =  p-ki^k)  ^ 
Uk{xk)  y  Xk  £  Sk-  A  sequence  of  control  laws  corresponding  to  each  time  k  forms  a  policy  tt  = 
{uq,  pi{xi), . . . ,  pn-i{xn-i)}-  It  is  important  to  note  the  differences  between  the  control  structure 
here  and  that  of  OLOC;  a  control  sequence  Uq  consists  of  a  set  of  ‘actions  ’  where  a  control  policy  tt 
consists  of  a  set  of  ‘strategies  ’.  The  optimal  closed-loop  feedback  policy  tt*  for  some  initial  state  xq  is 
obtained  as  follows: 


TT*  =  arg  min  E 

7ren  k=0,l,...yN~l 


AT-l 


gN{xN)  +  9k{Xk,Pki^k),Wk) 


fc=0 


(5) 


subject  to  the  constraints 

Xk+i  =  fk{xk,Pk{xk):Wk) ,  TT  G  n,  k  =  0,l,...,N  -1 

where  11  is  the  set  of  all  admissible  policies  such  that  Uk  =  Pk{xk)  S  Uk{xk)  V  x*,  G  5fc.  If  the  initial 
state  Xo  is  uncertain,  then  the  expectation  would  be  taken  with  respect  to  xq,  woi  'iwii  •  •  • ,  the 

resulting  policy  would  be  tt  =  {/;ro(3;o))  ■■■■,  Pn-i{xn-i)}- 

An  important  distinction  between  the  CLFOC  and  OLOC  formulation  is  that  the  CLFOC  problem 
explicitly  accounts  for  the  feedback  mechanism  in  the  mathematical  formulation  by  modeling  future 
information  arrival  and  selecting  optimal  control  decisions  which  depend  on  the  future  information. 
Thus,  the  solution  of  the  CLFOC  problem  provides  an  optimal  mapping  for  all  feasible  future  realiza¬ 
tions  of  the  s^ate  to  control  decisions.  In  comparison,  the  solution  of  the  OLOC  problem  produces  a 
sequence  of  optimal  control  decisions  that  are  implemented  regardless  of  the  future  realization  of  the 
state. 

‘Dreyfus  [1]  refers  to  this  approach  as  Feedback  Optimal  Control  (FOC) 
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C.  Open-Loop  Optimal  Feedback  Control 

Open-Loop  Optimal  Feedback  Control  (OLOFC)  is  a  term  originally  coined  by  Dreyfus  [1],  and  as 
the  name  implies,  is  similar  to  OLOC  in  its  mathematical  formulation  but  differs  in  implementation. 
Like  the  OLOC  formulation,  the  OLOFC  formulation  assumes  that  no  future  realizations  of  the  state 
will  be  measured.  As  a  result,  the  control  structure  is  purely  a  function  of  time  and  consist  of  a 
sequence  of  actions,  i.e.  Uq  =  (uq,  Ui, . . . ,  un-i}.  However,  unlike  OLOC,  only  the  initial  optimal 
control  Uo(0)  =  Uq  is  applied,  and  the  optimization  problem  is  resolved  for  at  time  A:  =  1  based  on 
the  actual  versus  expected  realization  of  the  state  x{l).  Then,  optimal  control  uj(0)  =  Uj  is  applied, 
and  the  control  process  repeats  until  the  time  k  =  N.  Note,  as  time  progresses,  the  optimization 
horizon  shrinks,  i.e.  at  time  n,  the  planning  horizon  is  k  =  n,n  -\- 1, . . . ,  N  —  1. 

Since  the  arrival  of  future  information  is  withheld  from  the  optimization  problem,  the  optimization 
is  identical  to  the  OLOC  formulation;  the  optimization  of  (2)  subject  to  (1)  over  a  sequence  of  controls 
Uq  =  {uo,  Ui, . . . ,  un-i}  based  on  xq  is  as  follows: 


u*  =  arg  min  E 

uoeU  k=0,l,...,N-l 


N-l  'I 

gN(xN)  +  gk(Xk,  Uk,  Wk)  > 

k=0  J 


subject  to  the  constraints 

Xk+i  =  fk{xk,Uk,Wk) ,  Uk€Uk{xk)^  Xk^  Sk,  A:  =  0,  l,...,iV  -  1 


(6) 

(7) 


Again,  this  optimal  control  approach  produces  a  control  solution  that  is  solely  a  function  of  time. 

Having  stated  the  similarities  and  differences  between  OLOFC  and  OLOC,  we  will  now  focus  on 
the  similarities  and  differences  between  OLOFC  and  CLFOC.  In  both  approaches  feedback  is  used 
in  implementation;  however,  CLFOC  explicitly  models  the  feedback  mechanism  in  its  mathematical 
formulation,  whereas  OLOFC  neglects  the  feedback  mechanism  in  its  mathematical  formulation.  By 
neglecting  the  feedback  mechanism,  the  OLOFC  solution  is  simpler  to  solve.  However,  what  is  gained  in 
reducing  complexity  may  be  lost  in  the  quality  of  the  solution.  By  neglecting  the  feedback  mechanism, 
i.e.  future  information  arrival  and  subsequent  control  decision,  the  anticipatory  nature  of  CLFOC 
is  lost  in  the  OLOFC  formulation;  as  a  result,  the  OLOFC  is  reactive  versus  proactive  to  future 
realizations  of  the  state.  This  attribute  can  have  significant  impact  for  systems  in  which  there  are 
significant  time  delays  between  future  information  arrival.  To  illustrate  these  diff’erences,  Dreyfus  [1] 
presents  a  solutions  to  a  stochastic  optimization  problem  using  the  above  three  formulations. 

III.  Control  Techniques 

Given  the  stochastic  nature  of  the  plant  in  (1),  the  CLFOC  solution  will  achieve  the  best  per¬ 
formance.  If  the  plant  is  deterministic,  all  three  optimization  formulations  would  produce  the  same 
performance,  since  feedback  is  only  required  if  there  is  uncertainty.  A  control  technique  that  solves  the 
CLFOC  problem  is  Stochastic  Dynamic  Programming  (SDP).  However,  this  technique  is  only  tractable 
for  small  problems.  In  general,  one  must  resort  to  approximation  techniques.  In  this  section,  two  dif¬ 
ferent  approximate  optimal  control  techniques  —  in  the  context  of  the  above  formulations  —  will  be 
presented.  The  first  technique,  Rollout,  is  an  approximation  to  CLFOC,  and  the  second  technique. 
Model  Predictive  Control,  is  an  approximation  to  OLOFC.  For  both  techniques,  the  implication  of  the 
approximations  on  the  solution  is  discussed. 

A.  Rollout 

The  Rollout  Algorithm  (RA)  [3],  [4]  is  an  approximate  CLFOC  approach  that  determines  a  sub- 
optimal  policy  on-line  based  on  the  actual  realization  of  the  state  x{k)  at  time  k.  Like  CLFOC,  the 
RA  assumes  that  future  realizations  of  the  state  are  measurable;  however  the  control  strategies  to  be 
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used  in  selecting  future  decisions  are  known,  consisting  of  an  suboptimal  rule  fl{xk)  at  each  future 
time  k  that  maps  all  feasible  realizations  of  x{k)  to  feasible  u{k),  i.e.  Uk  =  ft{xk)  €  Uk{xk)  V  Xjt  €  Sk- 
As  a  result,  the  RA  policy  has  the  following  form:  =  {uq, /i(xi), . . . , Like  OLOFC, 

only  the  initial  optimal  control  (0)  =  Uq  is  applied,  and  the  optimization  problem  is  resolved  for 
=  {ui,p,{x2),  ■  ■  ■  ,/i(a:Ar_i)}  at  time  k  =  1  based  on  the  actual  realization  of  the  state  x^.  Then, 
the  optimal  control  Trf'^*  (0)  =  ul  is  applied,  and  the  control  process  repeats  until  the  time  k  =  N. 
As  before,  the  optimization  horizon  shrinks  as  time  progresses,  i.e.  at  time  n,  the  planning  horizon  is 
A:  =  n,  n  +  1, . . . ,  A  —  1. 

Since  the  feedback  mechanism  —  future  information  arrival  and  control  decisions  —  is  explicitly 
modeled  in  this  formulation,  the  optimization  problem  is  identical  to  the  CLFOC  with  the  exception 
that  optimization  over  future  strategies  tt  is  replaced  with  the  known  strategy  7rf^,  i.e.  iJ.k{xk)  is 
approximated  by  a  known  suboptimal  baseline  strategy  p.{xk).  Given  that  the  baseline  strategy  is 
known,  the  optimization  over  all  admissible  is  equivalent  to  selecting  uq;  thus,  based  on  the  initial 
condition  xq,  the  RA  optimization  problem  is  as  follows: 


Un  =  arg  min  B 

uo€Uo{xo)  1 

subject  to  the  constraints 

Xl  =  foiXo,Uo,Wo),Uo  G  Uoixo) 

Xk+I  =  fk  {Xk,  fi{xk),Wk) ,  jj.{xk)  ^Uke  Uk{xk)  M  Xk  e  Sk,k  ^  1, ...,  A  -  1 

The  RA  is  related  to  the  CLFOC  formulation  since  it  accounts  for  future  information  arrival  and  control 
decisions.  As  a  result,  RA  based  controllers  will  exhibit  anticipatory  behavior.  However,  the  RA  is  not 
as  ambitious  as  the  CLFOC  formulation,  and  only  provides  modest  guarantees  of  near-optimality  [4]. 

Implementation  RA  requires  computing  the  expectation  with  respect  to  random  disturbances 
Wo,  it;i, . . . ,  by  modeling  future  information  arrival  and  the  resulting  control  decisions  via  the 

baseline  strategy.  An  approach  for  computing  this  expectation  at  a  given  state  Xk  and  time  k  is  to 
use  Monte  Carlo  simulations.  To  implement  this  approach,  we  consider  all  possible  controls  Uk  G 
Uk{xk)  and  generate  a  “large”  number  of  simulation  trajectories  starting  from  Xk-  Thus,  the  simulated 
trajectory  has  the  form: 


N-\ 


g^[xN)  +  ^gk{xk,gi{xk),Wk) 


(8) 


Xk+\  =  f{Xk,Uk,Wk) 

Xi+i  =  f{xi,fi{xi),Wi)  i  =  k  +  l,...,N  -1 


B.  Model  Predictive  Control 

The  Model  Predictive  Control  (MPC)  [5]  or  Receding  Horizon  Control  (RHC)  is  an  approximate 
OLOFC  technique  that  solves  a  deterministic  optimization  problem.  Given  that  this  technical  mem¬ 
orandum  focuses  on  stochastic  optimal  control,  the  MPC  technique  requires  an  approximation  to  the 
expectation  cost  function  in  (6)  that  is  parameterized  by  the  control  sequence  u^.  One  common  ap¬ 
proach  is  to  assume  that  a  certainty  equivalence  principle  holds  and  that  the  random  disturbances 
Wo,  Wi, . . , ,  Wat-i  in  (1)  can  be  replaced  by  their  expected  values  wq,  wi, . . . ,  wat-i.  Of  course,  an  un¬ 
derlying  assumption  here  is  that  the  random  disturbances  Wk  do  not  depend  on  the  state  Xk  and  the 
control  Uk-  Ojiher  approaches  can  be  used  to  approximate  the  expectation  cost  function  in  (6). 

Like  the  OLOFC  formulation,  the  MPC  formulation  assumes  that  no  future  realizations  of  the  state 
will  be  measured.  As  a  result,  the  control  structure  is  purely  a  function  of  time  and  consist  of  a  sequence 
of  actions,  i.e.  Uq  =  {uq,  Ui, . . . ,  un-i}.  Additionally,  the  current  sequence  of  optimal  control  actions 
are  determined  on-line  at  each  time  k  using  the  current  state  of  the  plant  Xk  and  only  the  first 
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control  in  this  sequence  uJ^(O)  =  uj^  is  applied  to  the  plant.  The  MFC  optimization  problem  where  it 
is  assumed  that  certainty  equivalence  principle  holds  is  as  follows: 


Uq  =  arg  min 

uoCU 


N-1 


gN(xN)  +  gk(Xk,  Uk,  Wk) 


k=0 


(10) 


subject  to  the  constraints 


Xk+\  =  fk{3:k,Uk,Wk) ,  Uk  e  Uk{xk) 'i  Xk  e  Sk,  k  =  0,1, N  -  1  (11) 

A  few  notes  about  MFC  are  warranted.  First,  because  MFC  is  based  on  a  OLOFC  formulation  and 
does  not  model  the  feedback  mechanism,  the  MFC  solution  can  only  react  to  future  realizations  of  the 
state.  As  Mayne  states  [5],  “A  defect  of  model  predictive  control  of  uncertain  systems,  not  yet  widely 
appreciated,  is  the  open-loop  nature  of  the  optimal  control  problem.”  Secondly,  modeling  a  stochastic 
system  as  deterministic  raises  the  questions  of  robustness,  i.e.  the  maintenance  of  certain  properties 
such  as  stability  and  performance  in  the  presence  of  the  uncertainty.  The  inherent  robustness  properties 
of  the  MFC  formulation  have  been  studied  for  systems  that  are  linear  with  respect  to  input  u  and 
that  only  have  terminal  state  constraints.  In  an  attempt  to  add  uncertainty  to  the  mathematical 
formulation,  an  Open-Loop  Min-Max  Model  Fredictive  Control  formulation  has  been  proposed  [5]. 
However,  the  solutions  are  very  conservative  due  to  the  open-loop  nature  of  the  optimization.  As 
Mayne  states  [5],  “...  the  scenario  generated  in  solving  {the  open-loop  min-max  model  predictive 
control  problem  }  does  not  model  accurately  the  uncertain  control  problem  because  it  ignores  feedback 
by  searching  over  open-loop  control  sequences  {  Uk  }  in  minimizing  the  {  the  cost  function  }.”  Because 
of  this  deficiency,  Mayne  recommends  a  Feedback  Mini-Max  Model  Fredictive  Control  formulation 
that  replaces  the  control  sequence  Uk  with  a  control  policy  iTk  similar  to  the  ones  presented  above.  As 
Mayne  notes,  “The  feedback  version  of  the  model  predictive  control  appears  attractive  but  prohibitively 
complex.” 


IV.  Conclusions 

This  technical  memorandum  presented  three  different  types  of  control  philosophies  that  are  appli¬ 
cable  to  stochastic  optimal  control  problems.  The  different  optimal  control  formulations  result  from 
varying  the  assumptions  on  the  future  information  availability  and  on  the  optimal  control  structure. 
The  Open-Loop  Optimal  and  Open-Loop  Optimal  Feedback  Control  solution  assume  that  the  no  future 
realizations  of  the  state  are  measurable  in  the  mathematical  optimization  formulation.  The  distinction 
between  the  two  optimization  solutions  resides  in  their  implementations.  In  the  Open-Loop  Optimal 
Control  formulation,  the  optimal  control  sequence  generated  at  time  t  =  0  is  not  updated.  In  the 
Open-Loop  Optimal  Feedback  Control  formulation,  only  the  initial  control  in  the  optimal  control  se¬ 
quence  is  implemented  and  the  open-loop  optimization  problem  is  resolved  at  the  next  realization  of 
the  state.  In  contrast  to  these  two  approaches,  the  Closed-Loop  Feedback  Optimal  Control  formulation 
explicitly  models  the  feedback  mechanism,  i.e.  the  dependence  of  future  control  decisions  on  future 
information  arrival,  in  the  mathematical  formulation.  The  fundamental  distinctions  between  the  open- 
loop  optimizations  and  closed-loop  optimization  were  that  open-loop  optimizations  produce  a  set  of 
“actions"  where  as  the  closed-loop  feedback  optimization  produces  a  set  of  “strategies".  By  including 
the  feedback  mechanism  in  the  mathematical  formulation,  the  closed-loop  feedback  optimization  is 
proactive  versus  reactive  and  is  able  to  anticipate  likely  contingencies;  this  trait  produces  guaranteed 
superior  performance  in  stochastic  scenarios. 

In  addition  to  making  the  distinctions  between  different  control  philosophies,  two  approximate 
stochastic  optimal  control  techniques  were  discussed.  The  Rollout  Algorithm  approximates  the  Closed- 
Loop  Feedback  Optimal  Control  formulation,  and  has  a  similar  implementation  to  the  Open-Loop 
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Optimization  Feedback  Control.  The  key  approximation  in  the  Rollout  Algorithm  is  that  future  deci¬ 
sions  in  the  feedback  mechanism  are  modehid  by  a  known  suboptimal  rule.  Next,  the  Model  Predictive 
Control  technique  approximates  the  Open-Loop  Optimal  Feedback  Control  formulation.  The  key  point 
here  is  that  the  Open-Loop  Optimal  expectation  cost  function  is  approximated  so  that  a  deterministic 
optimization  problem  is  solved. 
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Abstract 

In  this  paper,  we  investigate  alternatives  to  simulation- 
based  approximate  dynamic  programming  methods  for 
adaptive  multi-platform  scheduling  in  a  risky 
environment.  In  a  recent  effort,  we  considered  rollout 
algorithms,  in  which  on-line  simulation  was  found  to  be 
more  reliable  than  off-line  training.  Unfortunately,  a 
large  amount  of  computational  resources  was  required  to 
run  even  a  modest  number  of  Monte  Carlo  simulations. 
In  this  paper,  we  consider  alternatives  to  using 
simulation.  The  first  approach  consists  of  using  limited 
lookahead  policies,  which  reduce  computational 
requirements  by  considering  value  explicitly  over  a 
limited  horizon  and  approximating  the  value  of  the 
remaining  stages.  The  second  approach  decomposes  the 
problem  into  sub-problems  corresponding  to  platforms. 
In  our  computational  experiments,  we  found  that  many  of 
the  variations  of  these  approaches  required  significantly 
less  computation  time  than  rollout  algorithms  and  also 
obtained  results  that  were  substantially  superior. 

1.  Introduction 

The  planning  and  execution  of  multiple  missions  in  the 
presence  of  risk  is  a  problem  that  arises  in  many  important 
military  contexts.  In  data  collection  applications,  multiple 
UAV  platforms  may  be  tasked  to  interrogate  different 
areas,  with  the  risk  of  platform  destruction  as  each 
platform  pursues  its  collection  mission.  In  attack  air 
operations,  multiple  platforms  follow  risky  trajectories  to 


attack  enemy  targets.  For  both  applications,  sensors  and 
communication  equipment  can  provide  up-to-date 
information  concerning  individual  mission  and  platform 
status,  and  thus  provide  notification  of  platform  losses. 
This  creates  opportunities  for  replanning,  using  feedback 
to  retask  surviving  platforms  in  order  to  best  achieve 
mission  objectives. 

In  mathematical  terms,  the  above  class  of  problems 
can  be  formulated  as  Markov  decision  processes.  At  each 
stage  of  the  process,  decisions  are  made  that  affect  the 
evolution  of  a  system  state,  which  is  also  influenced  by 
random  discrete  events.  The  goal  is  to  select  the  current 
decision  as  a  function  of  the  current  state  in  order  to 
optimize  mission  performance. 

The  principal  approach  for  solving  Markov  decision 
problems  is  dynamic  programming  (DP).  In  comparing 
the  available  controls  at  a  given  state  /,  DP  considers  the 
current  stage  value,  but  also  takes  into  account  the 
desirability  of  the  next  state  j.  It  “ranks”  different  states  j 
by  using,  in  addition  to  the  current  stage  value,  the 
optimal  value  (over  all  remaining  stages)  starting  from  j. 

This  optimal  value  is  denoted  and  referred  to  as  the 

optimal  value-to-go  of  j.  Unfortunately,  it  is  well  known 
that  the  computation  of  J*is  overwhelming  for  many 
important  problems. 

There  has  been  a  great  deal  of  research  on  DP  methods 
that  replace  the  optimal  value-to-go  J*{j)  with  a  suitable 

approximation  for  the  purpose  of  comparing  the  available 
controls  at  each  state.  These  methods  are  collectively 
known  as  neuro-dynamic  programming  (NDP). 
Previously,  we  applied  a  particular  class  of  NDP 
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algorithms,  known  as  rollout  algorithms,  to  risky  multi¬ 
platform  planning  and  scheduling  problems.  Rollout 
algorithms  are  a  form  of  NDP  that  exploit  knowledge  of 
suboptimal  heuristic  decision  rules  to  obtain 
approximations  to  the  optimal  value-to-go.  We  developed 
several  rollout  algorithms  for  risky  multi-platform 
scheduling,  using  on-line  Monte  Carlo  simulations  to 
evaluate  the  reference  base  heuristic  policies,  and  found 
that  they  performed  significantly  better  than  the  base 
policies  as  well  as  off-line  training  methods.  However, 
even  using  a  modest  number  of  Monte  Carlo  simulations 
resulted  in  large  computation  times. 

In  this  paper,  we  consider  alternatives  to  using  on-line 
simulations.  In  particular,  we  consider  two  approaches 
that  use  analytic  approximations  of  the  value  function.  We 
first  consider  a  class  of  approximation  techniques  in 
which  the  control  exercised  at  a  state  /  is  determined  by 
considering  the  costs  accumulated  over  several  stages,  and 
then  applying  an  approximation  to  the  value-to-go  from 
the  resulting  states.  The  rollout  algorithms  considered  in 
our  previous  effort  are  a  special  case  in  which  a  single- 
stage  policy  is  employed  and  on-line  simulation  is  used  in 
combination  with  a  base  heuristic  to  approximate  the 
value-to-go. 

Our  second  approach  involves  exploiting  the  structure 
of  the  problem  and  decomposing  the  problem  into  sub¬ 
problems,  each  of  which  is  associated  with  a 
corresponding  platform.  Each  sub-problem  is  solved 
independently  but  takes  into  account  the  results  of 
previously  solved  sub-problems. 

The  paper  is  organized  as  follows.  In  Section  2,  we 
describe  the  data  collection  problem  which  we  are 
addressing.  In  Section  3,  we  present  the  framework  for 
limited  lookahead  policies.  In  Section  4,  we  describe  our 
decomposition  approach  to  the  problem.  In  Section  5,  we 
present  some  computational  results. 

2.  Example  Data  Collection  Problem 

The  graph  in  Figure  1  is  an  example  corresponding  to 
a  data  collection  problem.  Each  node  represents  a 
geographical  area  of  interest  with  a  one-time  value  (i.e., 
data  may  only  be  collected  once  from  each  location).  The 
arcs  represent  connectivity  among  the  geographical 
regions  and  may  be  successfully  traversed  with  a  known 
probability.  Platforms  traverse  the  graph  and  collect  data 
(value)  at  each  node,  or  else  they  are  destroyed  while 
traversing  specific  arcs.  If  a  platform  is  destroyed  on  an 
arc,  the  value  of  the  destination  node  is  not  collected, 
which  can  result  in  retasking  other  platforms. 


Figure  1  Graph  Representation  of  the  data 
collection  problem. 

The  objective  is  to  control  the  platforms  in  order  to 
maximize  the  expected  total  value  collected  after  N  stages. 
Each  platform  begins  at  a  base  node  (in  this  case,  node  0 
for  all  platforms)  and  may  traverse  one  arc  during  each 
stage.  There  is  a  reward  for  each  platform  that  has  safely 
returned  to  its  base  node  at  the  end  of  the  Mh  stage. 

3.  Limited  Lookahead  Policies 

Consider  a  discrete-time  dynamic  system, 
Xk+\=fk(Xk,UkyCOk)  , 

where  xk  is  the  state,  Uk  is  the  control  to  be  selected  from 
a  finite  set  Ukixk),  and  cok  is  a  random  disturbance. 
Denote  the  single-stage  reward  of  control  u  from  state  X 
and  disturbance  co  by  gk  {x,u,co) .  A  control  policy 
maps,  for  each  stage  k  a  state  xk  to 
a  control  value  jUk{xk)^Uk{xk)  •  There  is  a  terminal 
reward  G{xn)  that  depends  on  the  terminal  state  Xm  .  The 
value-to-go  of  an  optimal  policy 

starting  from  a  state  Xk  at  stage  k  can  be  computed  using 
the  following  DP  recursion 

Jk{xk)=  max  E{gk{xk,Uk,C0k)+Jk+\{fk{xk,Uk,C0k))\, 

for  all  k  and  with  the  initial  condition 
Jn(xn)=G{xn)  . 

For  our  problem,  the  state  can  be  represented  by  a 
vector  indicating  for  each  node  whether  or  not  its  value 
has  been  collected  and  by  another  vector  indicating  for 
each  platform  whether  or  not  it  is  alive  and  if  so,  the  node 


at  which  the  platform  is  located.  The  control  at  a 
particular  stage  provides  for  each  platform  that  is  alive  a 
node  that  the  platform  is  to  attempt  to  visit  during  the 
current  stage.  If  the  platform  successfully  traverses  the 
arc  connecting  its  current  node  to  the  next  node  and  the 
value  of  the  node  has  not  yet  been  collected,  the  current 
stage  reward  includes  the  value  of  the  node.  If  the 
platform  successfully  reaches  its  base  node  during  the  last 
stage,  there  is  a  terminal  reward  associated  with  the 
platform. 

Under  a  one-step  lookahead  policy,  the  control 
selected  at  stage  k  and  state  Xk  is  that  which  maximizes 
the  following  expression: 

max  E^gk  (xit  ,uk  ,C0k  )+7jt+i  {fk  [xk  ,Uk  ,(Ok ))}, 

Uk^Uk{xi) 

where  Jk+\  is  some  approximation  of  the  value-to-go 
function  Jk+\ .  Under  a  two-step  lookahead  policy,  the 


control  selected  at  stage  k  and  state  Xk  is  that  which 


maximizes  the  above  expression  when  Jk+\  is  itself  a  one- 
step  lookahead  approximation;  i.e.,  for  all  possible  states 
XA:+1  ==  fk  {xk  M  ,COk ) ,  we  have 


Jk+\{xk+\)-  max  E 


gk-v\  {xk,Uk,0)k)^ 

Jk-^2  (/jt+1  (x^+1  ,Uk+\  ,COk-^\ 


Other  multi-stage  lookahead  policies  are  similarly  defined. 
Note  that  the  number  of  lookahead  stages,  M,  should  be 
less  than  or  equal  to  N-kA.  Essentially,  the  M-stage 
lookahead  policy  selects  at  stage  k  its  decision  by 
determining  the  optimal  policy  if  there  were  only  M  stages 
remaining  and  the  terminal  cost  was  given  by 

E^k^M+\{xM^,  where  Xm  is  the  state  resulting  from 

Xm 

applying  the  policy  for  the  M  decisions.  A  decision  is 
selected,  and  the  process  is  repeated  at  the  next  stage. 
The  lookahead  horizon  is  limited  to  the  number  of 
remaining  stages,  and  so  if  the  number  of  remaining  stages 
is  less  than  M,  the  M-stage  lookahead  policy  determines 
the  optimal  strategy.  A  special  case  of  such  policies  in 
which  the  value-to-go  is  approximated  with  zero  is 
referred  to  in  the  literature  as  rolling  or  receding  horizon 
procedures. 

Generally,  the  effectiveness  of  limited  lookahead 
policies  depends  on  two  factors: 

1.  The  quality  of  the  value-to-go  approximation  - 
performance  of  the  policy  typically  improves  with 
approximation  quality. 

2.  The  length  of  the  lookahead  horizon  -  performance 
of  a  policy  typically  improves  as  the  horizon 
becomes  longer  (at  least  for  small  horizon  lengths, 
e.g.,  1-4). 

However,  as  the  size  of  the  lookahead  increases,  the 
number  of  possible  states  that  can  be  visited  increases 
exponentially.  To  keep  the  overall  computation  practical. 


the  complexity  of  the  value-to-go  approximation  should 
be  reduced  for  larger  lookahead  sizes.  Balancing  such 
tradeoffs  is  therefore  a  critical  element  in  determining  the 
size  of  the  lookahead  and  the  method  for  approximating 
the  value-to-go.  This  paper  explores  several  possibilities 
and  tries  to  quantify  the  associated  tradeoffs.  One  of  the 
advantages  of  using  limited  lookahead  policies  for  our 
particular  problem  is  that  the  number  of  controls  at  a 
particular  stage  is  fairly  small  and  as  a  result,  the 
computation  required  to  explore  all  states  that  can  be 
visited  over  the  next  M  stages  is  manageable  for  small  M. 

3.1.  Pruned  Limited  Lookahead  Policies 

Since  the  number  of  states  that  can  be  visited  over  M 
stages  grows  exponentially  in  M  and  also  in  the  number  of 
platforms,  limited  lookahead  policies  for  M>1  are 
impractical  for  problems  with  many  platforms.  One 
approach  to  reducing  the  computation  required  for  limited 
lookahead  policies  is  to  limit  the  number  of  states  that  can 
be  visited.  This  can  be  accomplished  by  “pruning” 
controls  that  yield  inferior  intermediate  values. 

A  pruned  version  of  a  limited  lookahead  policy 
depends  on  an  integer  parameter  B  that  is  typically 
selected  through  trial  and  error.  In  particular,  we 
determine  the  one-step  lookahead  values  for  all  controls 
available  from  our  initial  state.  Controls  that  are  not 
among  those  with  one  of  the  B  best  one-step  lookahead 
values  are  pruned.  We  then  repeat  this  process  for  each 
state  that  can  be  reached  from  a  control  that  was  not 
pruned  and  determine  the  one-step  lookahead  values  for 
all  controls  available  from  these  states.  For  each  of  these 
states,  controls  that  are  not  among  those  with  one  of  the  B 
best  one-step  lookahead  values  are  pruned.  The  number 
of  times  this  process  takes  place  is  equal  to  the  size  of  the 
lookahead. 

Since  the  number  of  controls  that  are  expanded  from 
every  state  at  every  stage  is  limited,  the  computation 
required  to  find  pruned  policies  is  not  exponential  in  the 
number  of  platforms.  However,  the  computation  is  still 
exponential  in  the  size  of  the  lookahead. 

4.  Platform  Decomposition 

We  now  present  an  approach  that  involves  exploiting 
the  structure  of  our  specific  problem  and  decomposing  it 
into  a  set  of  simpler  problems.  In  particular,  we 
decompose  the  problem  into  a  separate  sub-problem  for 
each  platform.  This  sub-problem  consists  of  determining 
the  optimal  sequence  of  nodes,  or  path,  to  visit  assuming 
that  platform  was  the  only  one  available.  The  optimal 
solution  to  each  sub-problem  can  be  found  analytically. 
After  a  sub-problem  is  solved  for  a  particular  platform  and 
before  the  next  sub-problem  is  solved,  the  value  of  each 


node  in  the  associated  path  is  updated  to  the  value  of  the 
node  multiplied  by  the  probability  that  the  node  was  not 
visited  by  the  platform.  This  allows  platforms  to  take  into 
account  paths  assigned  to  previously  scheduled  platforms. 
When  all  of  the  sub-problems  have  been  solved,  a  set  of 
paths  for  each  platform  results.  An  outline  of  the  platform 
decomposition  approach  is  given  below. 

1.  Assume  that  the  platforms  are  ordered  1,2,...,F, 
and  start  with  platform  /=  1 . 

2.  Solve  the  single-platform  problem  optimally  by 
finding  a  path  or  sequence  of  nodes 

that  the  platform  should  attempt  to 
visit  in  order  to  maximize  its  expected  value  (which 
consists  of  collected  node  values  plus  the  reward 
for  the  platform  returning  to  the  base  station  if 
is  the  base  node). 

3.  For  every  node  in  the  path  obtained  in  (2),  scale  the 
value  of  the  node  to  1  minus  the  probability  that  the 
node  will  be  visited  by  platform  /.  This  allows 
platforms  that  are  scheduled  later  to  take  into 
account  the  path  assigned  to  the  current  platform. 

4.  If  i  is  less  than  the  number  of  platforms,  then  let 
z^/+l  and  go  to  (2).  Otherwise,  we  are  done. 

The  single-platform  problem  in  step  2  can  be  solved 
using  dynamic  programming  or  by  exhaustively 
considering  all  possible  paths  with  N  nodes.  The 
computation  required  in  either  case  is  0(D^) ,  where  N  is 

the  number  of  stages  and  D  is  the  average  degree  of  a 
node.  For  sparsely  connected  graphs,  the  computation 
required  is  minimal. 

The  set  of  sub-problems  can  be  solved  once  for  a 
particular  ordering  of  platforms  or  multiple  times  for 
various  platform  orderings.  We  will  discuss  several 
possibilities  in  the  next  section. 

The  platform  decomposition  heuristic  yields  for  each 
platform  i  a  path  (rty  ,«/(y+i) ) ,  where  j  is  the  stage  at 
which  the  heuristic  is  applied.  This  heuristic  can  be 
applied  once  before  the  mission  begins  to  obtain  a  policy 
in  which  platform  i  attempts  to  visit  node  ny  during  the 

yth  stage  if  it  has  not  yet  been  destroyed.  The  heuristic 
can  also  be  applied  at  every  stage  (for  platforms  that  are 
still  alive)  using  up-to-date  state  information,  obtaining  a 

policy  in  which  platform  i  attempts  to  visit  node  riy 
during  the yth  stage.  Finally,  the  heuristic  can  also  be  used 
to  compute  a  value-to-go  approximation  for  limited 
lookahead  policies. 

One  of  the  main  advantages  to  the  platform 
decomposition  approach  is  that  the  computation  required 
is  considerably  smaller  than  limited  lookahead  policies. 
Assuming  that  the  number  of  platform  orderings 
considered  remains  fixed,  the  computation  grows  linearly 
in  the  number  of  platforms.  In  addition,  as  will  be  seen 
below,  the  method  obtains  solutions  that  are  very  close  to 


the  optimal.  Unfortunately,  while  limited  lookahead 
policies  generalize  easily  to  other  problems,  other 
problems  may  not  have  structures  that  easily  decompose 
into  sub-problems. 

5.  Computational  Results 

We  now  present  some  computational  results  firom 
applying  the  above  approaches  to  the  problem  described 
in  Section  2.  We  consider  a  problem  with  N~10  stages, 
and  either  three  or  four  platforms.  The  return  rewards  for 
the  platforms  were  set  to  12.7,  17.5,  19.2,  and  55.0,  and 
the  most  valuable  platform  was  not  included  in  the  three- 
platform  problems. 

5.1.  Limited  Lookahead  Policies 


A  limited  lookahead  policy  consists  of  two  main 
elements:  the  lookahead  horizon,  and  the  approximation 
of  the  value-to-go.  We  vary  the  size  of  the  horizon  from 
one  to  three  and  consider  a  number  of  approximations  to 
the  value-to-go.  While  there  is  some  difference  in  the 
complexity  of  the  value-to-go  approximations,  each  one  is 
straightforward  to  compute. 

In  many  of  our  approaches,  the  value-to-go 
approximation  for  a  particular  state  x  after  the  first  k 

stages,  Jk  (x) ,  involves  heuristically  generating  for  each 


platform  /,  a  path  or  sequence  of  nodes 
(A?/(A:+i),«/(jt+2)v,«w)  to  attempt  to  visit  during  the 


remaining  N-k  stages.  We  denote  this  collection  of  paths 
P{x,k).  Assuming  each  platform  attempts  to  visit  the 
nodes  in  its  path,  we  can  determine  the  expected  collected 
value  resulting  from  visiting  nodes  not  visited  during  the 
first  k  stages: 


c[p{x,k)]=  I  |i- 

platforms  i  J 


nodes  n  not 
yet  visited 


In  the  above  equation,  is  the  one-time  value  associated 
with  node  n,  and  pm  is  the  probability  that  platform  i 
visits  node  n: 


pin  — 


/-I 


Wpiny  ,«/(y+i)),  if  «//=Z2  for  some /, 

1  ;=^+i 

0,  otherwise, 

where  jp(«/y,«/(y+i))  is  the  probability  of  successfully 


traversing  the  arc  connecting  nodes  Hy  and  W/(y+i) .  To 
understand  the  expression  for  C\P{x,k)\,  note  that  the 
term  fj(l-/'m)  provides  the  probability  that  none  of 

platforms  i 

the  platforms  successfully  visits  node  «.  The  term 


f.- 


Cn  then  provides  the  expected  collected 


platforms/  J 

value  at  node  n  (the  probability  that  at  least  one  platform 
successfully  visits  the  node  multiplied  by  the  node  value). 

We  can  also  determine  the  expected  reward  resulting 
from  platforms  returning  to  the  base  node: 

R[P(x,k)]=  , 


platforms  / 


where 


Y{p[nij,mu+\)), 

j^k+\ 


0, 


if  riN  is  the  base  node, 
otherwise, 


is  the  probability  that  platform  i  returns  to  the  base  node 
and  Vj  is  the  platform  return  reward. 

The  approximations  to  the  value-to-go  that  we 
consider  are  given  below.  As  can  be  seen  in  the 
descriptions,  many  of  the  approximations  involve  a 
combination  of  the  expected  collected  node  value, 
C[p{x,k)],  and  the  expected  platform  return  reward, 
assuming  each  platform  attempts  to  visit  the 
nodes  in  the  paths  specified  in  P{x,k) . 


1.  The  first  approach  approximates  the  value-to-go 
with  zero: 


7k(x)=0. 

2.  The  second  approach  approximates  the  value-to-go 
with  the  sum  of  the  expected  collected  node  value 
and  the  expected  platform  return  reward  collected 
over  a  set  of  greedy  paths: 

J,{x)=C[Pg{x,k)]^R[Pg{x,k)]. 

The  nodes  along  the  greedy  path  for  platform  /, 
(«/(A:+i)vj«//v)  >  are  determined  as  follows: 
n/(y+i)  =arg  mp  {p{rtij  ,n}cn  } , 

neT](n,j) 

where  T]{nij )  is  the  set  of  nodes  that  can  be  reached 
from  node  riij ,  and  riik  is  the  node  at  which 

platform  i  is  located  after  k  stages. 

3.  The  third  approach  approximates  the  value-to-go 
with  the  expected  platform  return  reward  collected 
over  the  set  of  “safest”  paths: 

The  safest  path  is  that  which  yields  the  highest 
probability  of  a  platform  returning  successfully  to 
its  base  node.  These  paths  can  be  computed  apriori 
using  dynamic  programming.  (Essentially,  the 
computation  is  equivalent  to  solving  a  set  of 
shortest  path  problems.) 

4.  The  fourth  approach  approximates  the  value-to-go 
with  the  sum  of  the  expected  collected  node  value 


and  the  expected  platform  return  reward  collected 
over  the  set  of  safest  paths: 

Jk  (x)=C[/’.  (x,A:)]+i?[P.  (x,A:)] . 

5.  The  fifth  approach  approximates  the  value-to-go 
with  the  sum  of  the  expected  collected  node  value 
and  the  expected  platform  return  reward  collected 
over  the  set  of  “most  valuable”  paths: 

Jk  (x)=C[P„  (x,A:)]+i?[P„  (x,A:)] . 

The  most  valuable  path  is  that  which  yields  the 
highest  expected  total  value  that  could  be  attained 
by  a  single  vehicle  during  the  remaining  stages 
assuming  none  of  the  values  at  any  of  the  nodes 
have  yet  been  collected.  These  paths  can  also  be 
computed  apriori  using  dynamic  programming. 

6.  The  sixth  approach  combines  (4)  and  (5).  The 
value-to-go  is  approximated  with  the  maximum  of 
the  values  determined  by  those  approaches. 

Table  1  provides  the  expected  optimal  values  for  the 
problem  illustrated  in  Figure  1  for  a  three-platform 
problem  and  a  four-platform  problem.  We  have  computed 
these  values  using  dynamic  programming,  and  the 
computation  required  for  the  four-platform  problem  was 
approximately  one  week  on  a  Sun  Ultra  60  workstation. 
Table  1  also  provides  the  results  of  applying  a  greedy 
algorithm,  in  which  each  platform  selects  as  its  next  node 
that  which  maximizes  its  expected  collected  value  for  that 
stage,  to  one  thousand  sample  trajectories.  The 
performance  achieved  in  our  earlier  efforts  of  applying 
rollout  strategies  using  20  or  more  Monte  Carlo 
simulations  ranged  on  average  from  600  to  610  for  the 
four-platform  problem. 


Table  1  The  expected  optimal  values  and  the 
results  of  applying  the  greedy  algorithm  for  the 
_ three  and  four  platform  problems, _ 


#  Platforms 

Expected 

Greedy 

Optimal 

Three 

574.5 

475.72 

Four 

641.0 

533.89 

Tables  2  and  3  provide  the  values  averaged  over  one 
thousand  sample  trajectories  by  applying  the  limited 
lookahead  polices  for  lookahead  sizes  of  one  to  three, 
using  the  six  value-to-go  approximations  described  above. 
The  particular  approximation  approach  used  is  given  in 
the  leftmost  column.  As  can  be  seen,  while  the  2-stage 
policies  generally  provided  results  that  improved 
significantly  upon  those  of  the  1 -stage  policies,  those  of 
the  3 -stage  policies  were  not  substantially  better  and  in  a 
few  cases  were  worse  than  those  of  the  2-stage  policies. 


The  sixth  value-to-go  approximation  seemed  to  yield 
slightly  better  results  than  the  other  approximations. 
However,  the  third  through  sixth  approximations  were 
basically  comparable.  Overall,  these  approaches 
improved  significantly  upon  the  greedy  algorithm  and 
were  able  to  obtain  values  close  to  the  optimal  for 
lookahead  sizes  greater  than  one.  For  lookahead  sizes 
greater  than  one,  these  approaches  were  also  able  to 
obtain  results  slightly  better  than  those  obtained  using 
rollout  strategies  with  Monte  Carlo  simulations. 


Table  2  The  results  of  applying  the  limited 
lookahead  policy  to  the  three-platform  problem. 

Value-to-go  1 -stage  2-stage  3-stage 
Approximation _ _ _ _ _ 

_ 1 _ 491.09  539.58  543.40 

2 


3 


1 -stage 

2-Stage 

491.09 

539.58 

520.57 

543.95 

506.55 

550.82 

500.69 

529.98 

554.10 

557.94 

555.97 

563.45 

553.76 


559.46 


Table  3  The  results  of  applying  the  limited 
lookahaead  policv  to  the  four-platform  problem. 


1 -stage  2-stage 


3-stage 


543.32 


589.56 


581.48 


574.06 


582.84 


595.44 


574.24 


607.31 


32 

8.5 


594.09 


615.10 


Tables  4  and  5  provide  the  average  values  obtained 
over  the  same  thousand  sample  trajectories  by  applying 
the  pruned  limited  lookahead  polices  for  lookahead  sizes 
of  two  and  three,  using  the  value-to-go  approximations 
described  above.  (Note  that  a  pruned  one-step  lookahead 
policy  is  equivalent  to  the  fully  expanded  one-step 
lookahead  policy.)  As  can  be  seen,  the  results  of  these 
approaches  do  not  vary  significantly  from  the  fully 
expanded  lookahead  policies.  In  some  cases,  the  pruned 
policies  performed  one  or  two  percent  worse  and  in  other 
cases,  they  performed  one  or  two  percent  better. 


Table  4  The  results  of  applying  the  pruned 
limited  lookahead  policy  to  the  three-platform 


Value-to-go 

Approximation 

2-stage 

3 -stage 

1 

538.56 

523.48 

2 

532.70 

551.10 

3 

550.82 

553.56 

4 

552.47 

559.46 

5 

556.38 

555.47 

6 

561.22 

563.82 

Table  5  The  results  of  applying  the  pruned 
limited  lookahead  policy  to  the  four-platform 
roblem. 


2-stage 

3 -stage 

573.19 

575.21 

605.57 

607.21 

608.98 

613.23 

595.55 

613.49 


616.21 

615.38 

592.50 

617.04 


5.2.  Platform  Decomposition  Results 

In  applying  platform  decomposition  to  our  problem, 
we  considered  the  following  approaches  to  ordering  the 
platforms: 

1 .  A  single  ordering  in  ascending  order  of  the 
platform  return  reward. 

2.  All  possible  orderings. 

3.  A  “rollout”  of  the  ordering  in  (1)  as  described  by 
Bertsekas,  Tsitsiklis  and  Wu  ([4]).  I.e.,  assuming 
that  the  first  i-\  platforms  have  been  selected,  the 
zth  platform  is  determined  as  follows: 

i.  Consider  each  remaining  platform  in  turn  as  the 
next  platform  and  leave  the  other  vehicles  in 
their  original  order. 

ii.  Solve  the  set  of  single-platform  problems  in  the 
given  order. 

iii.  Select  as  the  ith  platform  that  which  yields  the 
best  result. 

As  mentioned  in  Section  4,  there  are  several  ways  to 
apply  the  heuristic: 

•  The  heuristic  can  be  applied  once  to  obtain  a  policy 
for  all  stages. 

•  The  heuristic  can  be  applied  at  every  stage  to 
obtain  a  control  for  the  current  stage  using  current 
state  information. 

•  The  heuristic  can  be  used  to  generate  a  value-to-go 
approximation  for  a  limited  lookahead  policy. 


Table  6  provides  the  average  values  obtained  over  the 
same  thousand  sample  trajectories  by  the  platform 
decomposition  approach.  The  result  of  applying  the 
heuristic  for  all  possible  orderings  and  following  the  paths 
obtained  for  all  stages  is  provided  in  the  first  row.  The 
next  three  rows  provide  the  results  when  the  heuristic 
using  the  three  orderings  described  above  (least  expensive 
to  most  expensive,  all  possible  orderings,  and  a  rollout  of 
the  orderings)  is  reapplied  at  every  stage  to  obtain  the 
current  control.  The  remaining  rows  provide  the  results 
when  the  heuristic  is  used  to  provide  a  value-to-go 
approximation  for  a  one-stage  limited  lookahead  policy 
using  the  orderings  described  above  is  used.  As  can  be 
seen,  these  approaches  performed  extremely  well.  The 
heuristic  alone  performed  comparably  to  2-stage 
lookahead  policies,  and  the  other  variations  were  able  to 
obtain  strategies  that  yielded  results  that  were  less  than 
one  percent  from  the  optimal  expected  results. 

Table  6  The  results  of  applying  platform 
decomposition  approaches.  The  first  row 


total  time  to  compute  these  controls  for  the  ten  stages. 
Since  these  times  depend  on  the  state  trajectory  of  the 
system,  which  is  random,  we  averaged  over  100 
trajectories  and  recorded  the  results  in  Table  7.  The  times 
for  the  one-stage  lookahead  have  not  been  included  as  the 
time  required  was  negligible.  The  experimental  results 
were  conducted  on  a  Sun  Ultra  60  workstation.  As  can  be 
seen  from  the  table,  the  pruned  lookahead  policies  were 
significantly  faster  than  the  fully  expanded  lookahead 
policies.  Considering  this  in  combination  with  the  fact  that 
the  performances  of  the  two  versions  are  comparable 
suggests  that  that  pruned  lookahead  policies  may  be  more 
useful  in  practice.  The  pruned  lookahead  policies  were 
also  generally  much  faster  than  the  rollout  algorithms 
using  Monte  Carlo  simulations,  whose  computation  times 
varied  from  5  to  over  300  seconds  per  sample  trajectory. 
The  decomposition  approaches  were  extremely  fast,  and 
also  provided  the  best  results.  Reapplying  the 
decomposition  heuristic  at  every  time  step  appears  to  be 
the  best  option.  However,  it  is  not  clear  how  easily  such 
approaches  can  be  applied  to  variations  of  the  problem. 


provides  the  result  of  applying  the  heuristic  for 
all  possible  platform  orderings  before  the  start  of 
the  mission  and  following  the  resulting  paths. 
The  next  three  rows  provide  the  results  of 
reapplying  the  heuristic  at  every  stage  using 
various  platform  orderings  (1:  least  expensive  to 
most  expensive;  2:  all  possible  orderings;  3:  a 
rollout  of  the  orderings).  The  iast  three  rows 
provide  the  results  of  applying  one-stage  limited 


Table  7  Time  to  compute  the  controls  for  ten 
stages  under  the  various  approaches  averaged 
over  100  sample  trajectories  of  the  four-platform 
problem.  The  first  six  iines  provide  the  times 
corresponding  to  the  fully  expanded  and  pruned 
limited  lookahead  results  given  in  Tables  3  and 
5.  The  next  six  lines  provide  the  times 
corresponding  to  the  last  six  platform 


lookahead  policies  using  the  values  obtained 
from  the  platform  decomposition  heuristic 
(under  the  various  platform  orderings)  as  an 
approximation  to  the  value-to-go. 


Heuristic  alone 


decomposition  results  given  in  Table  6. 


■HJI1JL-I1I4JAIL1.HIJ511 


550.85 

608.89 

568.81 

634.97 

573.83 

637.81 

573.83 

637.81 

570.97 

633.04 

571.29 

571.29 

635.65  1 

5.3  Computation  Times 


_ 2-stage  lookahead 

_ Full  Pruned 

T.T.-1  (Ttt  0.12 


04 

22 


0.71 


0.91 


1.73 


PD-Heuristic  reapplied- 1 


PD-Heuristic  reapplied-2 


PD-Heuristic  reapplied-3 

PD- 1 -stage  LL-1 _ 

PD- 1 -stage  LL-2 _ 

PD- 1 -stage  LL-3 


LL-2 

9.41 

LL-3 

1.35 

LL-4 

6.16 

LL-5 

9.16 

LL-6 

14.85 

3 -stage  lookahead 


Full 

Pruned 

120.4 

1.8 

1358 

16.4 

134.7 

3.4 

0.41 


9.42 


1.62 

51.50 

396.52 

138.04 


The  following  table  provides  the  average  on-line 


computation  time  (in  seconds)  to  apply  the  approaches 
described  above  to  one  hundred  sample  trajectories  of  the 


6.  Summary 


four-platform  problem.  The  off-line  computation  time  for 
the  limited  lookahead  policies  was  negligible.  We  have 
measured  the  time  required  to  compute  the  controls.  In 
practice,  this  time  is  critical  since  it  must  be  within  the 
real-time  constraints  of  the  problem.  The  table  gives  the 


In  this  paper,  we  have  considered  alternatives  to  using 
on-line  simulations  for  approximating  the  value-to-go  for 
adaptive  multi-platform  scheduling  in  a  risky 
environment.  The  main  limitation  to  using  rollout 
algorithms  with  on-line  simulations  that  was  determined  in 


our  previous  effort  was  the  amount  of  computation 
required  to  evaluate  control  options  at  every  stage.  We 
instead  considered  two  alternatives. 

The  first  approach  involved  examining  control  options 
over  a  limited  horizon.  In  our  experimental  results,  this 
method  produced  results  that  were  slightly  better  than 
those  obtained  through  rollout  algorithms  with  on-line 
simulations  with  similar  computation  time.  Computation 
time  was  reduced  significantly  by  introducing  a  pruning 
technique  without  loss  in  performance. 

The  second  approach  involved  decomposing  the 
problem  into  sub-problems  associated  with  each  platform. 
This  method  produced  results  that  were  extremely  close  to 
the  optimal  values  and  required  small  computation  times. 
However,  while  limited  lookahead  methods  generalize 
well  to  other  problems,  the  decomposition  method 
requires  a  suitable  problem  structure.  Furthermore,  this 
method  may  not  perform  well  for  problems  with  an 
appropriate  structure  if  the  decomposed  elements  require 
significant  coordination. 
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Abstract 

In  this  paper,  we  investigate  the  use  of  rollout 
algorithms  for  adaptive  multi-platform  scheduling  in  a 
risky  environment.  The  underlying  decision  problem  is 
motivated  by  several  Air  Force  applications:  data 
collection,  sensor  management,  and  air  operations 
planning.  These  problems  may  be  solved  optimally  with 
stochastic  dynamic  programming  (SDP),  but  have 
overwhelming  computational  requirements.  Rollout 
algorithms  reduce  computational  requirements  by  using 
on-line  learning  and  simulation  to  approximate  SDP  with 
a  base  heuristic.  While  they  do  not  aspire  to  optimal 
performance,  rollout  algorithms  typically  result  in  a 
consistent  and  substantial  improvement  over  the 
underlying  heuristics.  A  multi -platform  planning  and 
scheduling  problem  is  used  to  demonstrate  rollout 
performance. 

1  Introduction 

The  planning  and  execution  of  multiple  missions  in 
the  presence  of  risk  is  a  problem  which  arises  in  many 
important  military  contexts.  In  data  collection 
applications,  multiple  UAV  platforms  may  be  tasked  to 
interrogate  different  areas,  with  the  risk  of  platform 
destruction  as  each  platform  pursues  its  collection 
mission.  In  attack  air  operations,  multiple  platforms 
follow  risky  trajectories  to  attack  enemy  targets.  For 
both  applications,  sensors  and  communication 
equipment  can  provide  up-to-date  information 
concerning  individual  mission  and  platform  status,  and 
thus  provide  notification  of  platform  losses.  This 
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creates  opportunities  for  retasking  surviving  platforms 
in  order  to  best  achieve  mission  objectives. 

In  mathematical  terms,  the  above  class  of  problems 
can  be  viewed  as  a  sequential  decision  problem,  where 
each  decision  is  based  on  the  observation  of  certain 
discrete  events.  These  decisions  affect  the  evolution  of 
a  system  state  (mission),  which  is  also  influenced  by 
random  discrete  events  (e.g.  platform  destruction).  The 
goal  is  to  select  the  current  decisions  as  a  function  of 
the  current  system  state,  in  a  manner  that  optimizes 
mission  performance. 

The  above  class  of  problems  can  be  formulated  as 
Markov  decision  problems  [3],[5].  The  principal 
approach  for  solving  such  problems  is  dynamic 
programming  (DP),  which  selects  feedback  rules  to 
determine  optimal  controls  for  each  possible  state. 
These  optimal  controls  are  determined  by  evaluating  at 
each  stage  the  immediate  expected  cost  of  the  current 
decision,  plus  the  future  optimal  cost-to-go  over  future 
decisions.  However,  it  is  well  known  that  computation 
of  the  optimal  cost-to-go  for  each  future  state  is 
computationally  intractable  for  all  but  the  simplest  of 
problems,  making  direct  application  of  DP  an  impossible 
task  for  multi -platform  control. 

In  recent  years,  there  has  been  a  great  deal  of 
research  on  approximate  DP  methods  based  on 
computing  suitable  approximations  to  the  optimal  cost- 
to-go.  These  methods  are  collectively  known  as  neuro¬ 
dynamic  programming  (NDP)  [1].  In  NDP,  the  optimal 
cost-to-go  is  approximated  by  a  parametric  function; 
critical  issues  for  NDP  include  the  selection  of  the 
parametric  class  of  approximating  functions,  and 
selection  of  the  approximating  parameters. 
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In  this  paper,  we  apply  a  particular  class  of  NDP 
algorithms,  known  as  rollout  algorithms  [2],  to  risky 
multi-platform  planning  and  scheduling  problems. 
Rollout  algorithms  are  a  form  of  NDP  which  exploits 
knowledge  of  suboptimal  heuristic  decision  rules  to 
obtain  approximations  to  the  optimal  cost-to-go  for  use 
in  NDP.  We  develop  different  rollout  algorithms  for 
risky  multi-platform  scheduling,  and  illustrate  the 
relative  performance  of  the  rollout  algorithms  and  the 
original  suboptimal  decision  rules  in  the  context  of  a 
specific  example.  The  results  illustrate  that  significant 
performance  improvements  can  be  obtained  using 
rollout  algorithms,  with  a  modest  increase  in 
computation  complexity. 

2  Illustrative  Overview 

To  illustrate  the  types  of  problems  of  interest  and 
results  developed  in  this  paper,  consider  the  data 
collection  problem  illustrated  in  Figure  1.  There  are 
several  data  collection  assets,  which  may  travel  to 
examine  targets.  There  is  a  value  associated  with 
collecting  the  information  on  each  target.  Platforms  also 
run  the  risk  of  destruction  while  performing  collection 
on  a  asset,  due  to  the  presence  of  local  defenses. 


Figure  1  Illustration  of  Data  Collection  Problem 

Ideally,  each  data  collection  asset  will  be  provided  a 
schedule  of  targets  for  information  collection,  which  is 
coordinated  among  assets  to  ensure  maximal  value 
collected.  However,  due  to  the  risk  inherent  in  the 
collection  process,  platforms  can  be  destroyed,  and 
thus  the  original  schedules  should  be  adapted 
whenever  a  destruction  event  occurs  in  order  to  recover 
the  most  collection  value.  If  these  abrupt  events  are  not 
anticipated  in  the  original  schedules,  the  possible 
modifications  to  the  schedules  may  be  so  constrained 
that  highly  sub-optimal  performance  results. 

The  basic  theory  of  dynamic  programming  provides  a 
framework  for  developing  schedules  which  anticipate 
the  future  occurrence  of  contingencies  such  as  platform 
destruction,  and  hedge  the  selected  schedules  in 
anticipation  of  needed  retasking.  Thus,  the  resulting 


schedules  can  be  adapted  to  contingencies  with  minimal 
performance  degradation,  resulting  in  robust,  stable 
control. 

The  computational  requirements  of  DP  depend  on 
the  number  of  future  states  required  to  describe  the 
system.  To  illustrate  the  number  of  states  required, 
assume  that  there  are  targets,  Af  collection  assets,  and 
that  we  simplify  physical  position  descriptions  to 
describe  only  the  N  positions  of  the  targets.  Then,  the 
number  of  possible  combinations  of  positions  is 
and  the  number  of  possible  uncollected  target  sets  at  a 
given  time  is  2^,  resulting  in  numbers  of  states  (2Mf. 
For  modest  numbers  of  assets  and  targets,  the  number 
of  states  far  exceeds  our  capability  for  computing  and/or 
storing  the  resulting  optimal  decision  rules. 

Using  NDP  principles  such  as  rollout  strategies 
greatly  reduces  the  resulting  computational  complexity. 
DP  considers  all  of  the  possible  states  and  computes  a 
tentative  decision  for  each  possible  state,  whereas  NDP 
only  computes  decisions  for  states  that  actually  occur 
in  the  scenario.  Thus,  the  number  of  states  considered 
by  NDP  considered  is  much  smaller,  but  can  only  be 
determined  in  real-time.  In  the  rollout  methodology, 
once  the  scenario  reaches  a  given  state  where  a 
contingency  has  been  observed,  new  plan  options  are 
evaluated  in  real-time  to  select  the  future  actions.  The 
result  is  a  practical  algorithm  for  feedback  control  in 
complex  multi-platform  planning  and  scheduling 
applications.  The  fundamental  questions  about  this 
approach  are  how  good  is  the  performance  achieved, 
and  how  much  real  time  computation  is  required.  These 
questions  are  explored  in  greater  detail  in  the 
subsequent  sections. 

3  Rollout  Algorithms 

Consider  a  discrete-time  version  of  a  dynamic 
decision  problem, 

where  is  the  state,  is  the  control  to  be  selected 
from  a  finite  set  U{x^),  and  o)^  is  a  random 
disturbance.  Denote  the  single-stage  cost  of  control  u 
from  state  x  and  disturbance  ct)by 

A  control  policy  ;r  maps,  for  each 

stage  k,  a  state  x  to  a  control  value  (x)g  U{x)  .  In  the 

A^-stage  horizon  problems  considered  herein,  k  takes 
values  0,1,---,A-1>  and  there  is  also  terminal  cost 

G(x^  )  depends  on  the  terminal  state  x^  •  The  cost- 


to-go  of  policy  TT  starting  from  a  state  at  time  k  can 
be  computed  using  the  following  DP  recursion 

for  all  A:  and  with  the  initial  condition 

y'^(A:)=GW 

The  rollout  policy  based  on  n  is  denoted  by 
ir  ={p'Q,j[rp**-},  and  is  defined  by  the  operation 

/r*(x)=arg  min  £{g(x,M,<B)+y;^,(/'(jic,M,w))}  (2) 

ugU\x) 

for  all  X  and  A.  Thus  the  rollout  policy  selects  decisions 
by  balancing  the  current  cost  with  future  costs-to-go, 
where  the  optimal  costs-to-go  are  approximated  by  the 
performance  of  the  base  policy  n . 

A  straightforward  approach  for  computing  the  rollout 
control  at  a  given  state  x  and  time  A  is  to  use  Monte 
Carlo  simulations  of  the  base  policy.  To  implement  this 
approach,  we  consider  all  possible  controls  wg  U{x) 

and  generate  a  “large”  number  of  simulation  trajectories 
of  the  system  starting  from  x,  using  u  as  the  first 
control,  and  using  the  policy  n  thereafter.  Thus  the 
simulated  trajectory  has  the  form 

Xm  i  =  k+\,--N-\ 

where  the  first  generated  state  is 

The  costs  corresponding  to  these  trajectories  are 
averaged  to  obtain  the  0-factor 

Q{x,u)=  E{g{x,  U,(o)+ {f{x,  M,ft)))} 

In  reality,  only  an  approximation  q{x,u)  is  obtained 

because  of  the  associated  simulation  error.  The 
approximation  becomes  increasingly  accurate  as  the 
number  of  simulation  trajectories  increases.  Once  the 
approximate  0-factor  0(x,m)  corresponding  to  each 
control  MG  U{x)  is  computed,  we  obtain  the 
approximate  rollout  control  (x)  by  the  minimization 

Pk{x)=»rgnm  di,ix,u) 

ubU(x) 

4  Example:  Data  Collection  Problem 

The  graph  in  Figure  2  corresponds  to  an  example 
data  collection  problem.  Each  node  represents  a 
geographical  area  of  interest  with  a  one-time  value  (i.e., 
data  may  only  be  collected  once  from  each  location). 
The  arcs  represent  connectivity  among  the  geographical 
regions  and  may  be  successfully  traversed  with  a 
known  probability.  Platforms  traverse  the  graph  and 


collect  data  (value)  at  each  node,  or  else  they  are 
destroyed  while  traversing  specific  arcs.  If  a  platform  is 
destroyed  on  an  arc,  the  value  of  the  destination  node  is 
not  collected,  which  can  result  in  retasking  other 
platforms. 


Figure  2  Graph  Representation  of  the  Data 
Collection  Problem 

The  objective  is  to  control  the  platforms  in  order  to 
maximize  the  expected  total  value  collected  after  N 
stages  (a  =  10  will  be  used).  Each  platform  begins  at  a 

base  node  (in  this  case,  node  0  for  all  platforms)  and 
may  traverse  one  arc  during  each  stage.  If  a  platform 
does  not  return  to  its  base  node  within  N  stages,  there 
is  a  penalty  associated  corresponding  to  platform  loss. 

4.1  The  Base  Policy:  Greedy 

As  a  base  policy  for  rollout,  we  use  the  greedy  policy 
7C  =  which  is  defined  by  the  operation 

H^{x)=  argmax£{g(x,u,ft))} 

for  all  X  and  A.  The  control  m  is  a  vector  of  locations 
corresponding  to  the  next  destination  of  each  platform. 
Similarly,  each  element  of  ;u^(x)=  /t^(x ),•*■] 

corresponds  to  a  specific  platform. 

To  reduce  the  computational  overhead,  we  consider 
the  platforms  sequentially.  The  control  for  the  first 
platform,  jU^(x),  is  selected  independent  of  the  other 

platforms’  controls  as: 

max 

ueU  (x) 

where  f/®(x)  are  feasible  controls  for  platform  0.  The 
control,  lil{x)y  for  subsequent  platforms  is  conditioned 


on  all  the  previously  selected  controls 

defined  by  the  operation 

fii{x)=  £{g(x,M,£o]|M*  W.-  •  • .  (^)} 

This  allows  the  greedy  policy  to  anticipate  the  arrival 
of  platforms  at  specific  nodes  based  on  previously 
selected  controls. 

The  greedy  policy  also  forces  platforms  to  return 
within  N  stages  by  constraining  the  set  Uf.{x)  of 

feasible  controls  to  those  for  which  a  return  within  N 
stages  is  possible 

The  performance  of  the  greedy  policy  corresponds  to 
the  cost-to-go  from  the  initial  state  Xq  . 

(xo  )=  £’|g(a:^  )+^  (x,),  CD,  )| 

where  the  expectation  is  taken  over  simulation 
trajectories  of  the  form 

The  performance  of  the  greedy  policy  provides  a 
baseline  for  evaluating  the  rollout  policy. 

4.2  Rollout  Algorithm 

The  rollout  policy  is  computed  using  the  greedy 
policy  as  its  base  policy,  as  indicated  in  equations  (1-2). 
The  performance  of  the  rollout  policy  is  evaluated  in  a 
manner  similar  to  the  greedy  policy,  by  using  the  cost- 
to-go  from  the  initial  state  Xq  . 

J^o{xJ= 

with  the  simulation  trajectories 

=  fi^i . p; (^, ). )  i=0,  -  N-\ 

To  reduce  the  relative  variance  of  performance 
values,  we  use  the  same  simulation  trajectories  in  the 
evaluations  of  all  policies. 

One  drawback  of  this  approach  is  that  many  on-line 
Monte  Carlo  simulations  may  be  required  to  compute 
the  rollout  decision  at  a  state.  As  an  alternative,  we  can 
use  approximations  trained  with  off-line  simulations,  as 
discussed  in  the  next  subsection. 

4.3  Rollouts  and  Neural  Approximations 

To  reduce  the  on-line  computational  overhead  of  the 
rollout  policies,  we  propose  to  train  off-line  a  parametric 
approximation  of  the  greedy  policy  performance  based 
on  features  which  characterize  the  current  state.  In 
particular,  the  features  that  we  use  correspond  to  the 
values  achieved  by  the  greedy  policy  under  a  small 


number  of  certainty-equivalence  scenarios,  which 
capture  the  graphical  dependence  of  the  scheduling 
problem.  This  approach  was  initially  proposed  in  [2]. 

To  compute  a  feature  at  a  given  state  at  time  k, 

we  fix  the  remaining  disturbances  at  some  nominal 
values  generate  a  state  and 

control  trajectory  of  the  system  using  the  base  policy  tc 
starting  from  and  time  k.  The  corresponding  cost  is 

denoted  by  7^^(x^),  and  is  a  feature  which  is  used  to 
estimate  the  true  cost  We  use  a  small  number 

of  disturbance  trajectories  corresponding  to  different 
scenarios.  The  feature  values  computed  for  each  of 
these  scenarios  are  combined  parametrically  to 
approximate  the  cost  of  the  base  policy  using  the 
functional  form: 

Jk{xk^r)=r^+Y,r^C„{x^) 

m=l 

where  a*  =  (ro,r,,**-,r^)  is  a  vector  of  parameters  to  be 
determined,  and  C^{xj^)  is  the  cost  corresponding  to 

the  scenario.  The  parameters  r  are  determined  by  an 
off-line  training  process  using  simulations  of  the  base 
policy.  Equation  (3)  can  then  be  used  on-line, 
computing  the  costs  to  evaluate  the  base 

policy  cost  from  state  x^  at  time  k. 


5  Experimental  Results 


A  series  of  experiments  were  performed  on  the 
example  problem  presented  in  section  4.1,  evaluating  the 
performance  of  the  base  greedy  policy  and  different 
variations  of  rollout  algorithms. 

The  greedy  heuristic  used  for  the  baseline  policy  is 
based  on  an  objective  function  with  two  terms,  one 
associated  with  the  achievable  value  of  data  collected 
and  the  other  associated  with  the  potential  loss  of  the 
vehicle.  These  values  depend  on  probability  ratios 
associated  with  risk.  The  objective  function  for  vehicle 
k  at  state  x.  is  given  by 


gk{ui,Xi,(o) 


where  nj{x)  ts  the  achievable  value  of  option  j  given 
the  current  state,  is  the  value  of  vehicle  k,  and  p..  is 
the  transition  probability  associated  with  option  j  ( p.. 

characterizes  the  disturbance  (O).  This  objective 
function  is  a  risk  neutral  strategy  that  computes  the 
marginal  difference  between  the  largest  acceptable  loss 


and  the  smallest  acceptable  gain  associated  with  option 

/ 

The  greedy  heuristic  is  evaluated  by  determining  the 
cost-to-go  from  the  initial  state  Xq  . 

J\ (jCo )=  )+  X g{x, ,  )| 

where  the  expectation  is  approximated  with  100  Monte 
Carlo  simulation  trajectories  of  the  form 

^  fi^i )  )» “  0,*  •  *  ,9 

The  evaluation  of  the  greedy  heuristic  resulted  in  an 
estimated  value  of  569.3  (the  standard  deviation  of  the 
estimate  is  12.2).  As  a  benchmark,  the  total  value 
achievable  is  714,  arising  from: 


Total  Collectible  Value 

610 

Total  Vehicle  Value 

104 

We  conducted  three  types  of  experiments,  exploring 
different  rollout  options.  The  first  set  of  experiments 
used  different  number  of  Monte  Carlo  runs  in  the  rollout 
algorithm  to  evaluate  the  relative  performance  of  the 
different  controls  for  each  state  considered.  Tested 
conditions  ranged  from  5  to  40  Monte  Carlo  experiments 
per  decision. 

The  second  set  of  experiments  evaluated  alternatives 
in  the  planning  horizon  considered  in  the  rollout 
problem.  It  has  been  conjectured  [2]  that  the 
performance  of  rollout  strategies  degrades  after 
increasing  the  planning  horizon  beyond  a  threshold, 
due  to  the  approximation  of  the  optimal  future  policy  by 
a  base  policy.  This  approximation  becomes  less 
accurate  with  increasing  planning  horizon.  To  test  this , 
we  conducted  experiments  where  we  varied  the  horizon 
used  by  the  rollout  policy  to  evaluate  the  base  policy. 

The  final  set  of  experiments  compares  the 
performance  of  the  Monte  Carlo  rollout  algorithms  with 
the  performance  of  the  algorithms  based  on  parametric 
function  approximations  using  certainty  equivalence 
features. 

5.1  Variations  in  Monte  Carlo  Runs 

In  these  experiments,  the  number  of  Monte  Carlo 
runs  used  to  evaluate  the  performance  of  the  base 
policy  in  the  rollout  algorithm  varies  from  5,  10,  20,  30, 
and  40  Monte  Carlo  runs.  For  each  of  these 
experiments,  evaluation  of  the  rollout  policy 
performance  is  conducted  in  a  manner  identically  to  the 
evaluation  of  the  greedy  policy  performance  described 
previously.  That  is,  100  independent  Monte  Carlo 
simulation  trajectories  are  used,  of  the  form 

JCm  i  =  0,  -9 


where  p.[x. )  is  the  rollout  control  policy  defined  by  the 
operation 

p:Xx)  =  arg  max  E{g{x,u,  (o)+J’’m  {f{x,  u,  ft)))} 

ueU{x) 

and  this  last  expectation  is  also  approximated  with  either 
5,  10,  20, 30,  and  40  Monte  Carlo  simulations. 

The  results  of  this  experiment  are  shown  in  Figure  3. 
As  the  results  indicate,  the  performance  achieved  by  the 
rollout  strategies  using  20  or  more  Monte  Carlo 
simulations  range  on  average  from  600  to  610,  a  range 
which  is  far  superior  to  the  greedy  policy  performance 
average  of  569.3.  The  results  suggest  that  a  modest 
number  of  simulations  are  required  to  select  good 
controls  in  this  example. 

Rollout  Performance 
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Figure  3  Rollout  Performance 

5.2  Variations  in  Planning  Horizon 

The  rollout  algorithm  performs  a  single  policy 
improvement  step  on  the  greedy  heuristic.  This  allows 
the  rollout  trajectory  to  deviate  significantly  from  the 
greedy  trajectory,  especially  over  long  horizons.  In  this 
case,  the  cost-to-go  estimate  derived  from  greedy 
trajectories  may  not  reflect  the  actual  cost-to-go  of  the 
rollout  algorithm.  One  way  of  avoiding  this  problem  is  to 
evaluate  the  future  cost-to-go  of  the  base  policy  over  a 
limited  horizon.  Figure  4  shows  the  performance  of  this 
rollout  algorithm  with  various  horizons,  where  20  Monte 
Carlo  experiments  are  used  to  evaluate  each  policy. 


Rollout  Performance 


Figure  4  Performance  as  a  Function  of  Horizon 

The  results  in  Figure  4  do  not  support  the  conclusion 
that  there  is  a  maximum  planning  horizon  beyond  which 
the  rollout  performance  degrades.  However,  the  results 
need  closer  examination  to  understand  whether  the  use 
of  a  different  base  heuristic  would  exhibit  similar 
behavior. 

5.3  Off-Line  Training  vs  Monte  Carlo 

In  these  experiments,  we  compare  the  performance  of 
rollout  algorithms  based  on  the  parametric 
approximations  of  Section  4.3  with  the  Monte  Carlo 
rollout  algorithms  of  Section  4.2.  The  parametric 
approximations  were  based  on  certainty  equivalent 
features,  which  corresponded  to  selecting  specific 
threshold  values  and  declaring  all  arcs  with  probabilities 
of  survival  greater  than  the  threshold  to  be  safe,  and  all 
arcs  with  probabilities  of  survival  less  than  or  equal  to 
the  threshold  to  have  a  certainty  of  destroying  any 
vehicles  on  those  arcs.  The  resulting  graph  is  a 
deterministic  graph,  which  leads  to  fast  evaluation  of 
the  base  policy.  The  performance  obtained  for  different 
values  of  thresholds  provided  the  base  features  for  the 
parametric  approximation. 

Several  rollout  algorithms  were  evaluated:  First,  we 
used  algorithms  based  on  single  features,  with  trivial 
parametric  approximation.  Second,  we  used  algorithms 
using  weighted  combinations  of  features,  with  weights 
trained  off-line  using  training  data.  Finally,  we  used 
optimized  weights,  searching  in  the  space  of  possible 
weights  for  optimal  performance;  this  is  not  a  practical 
algorithm,  but  provides  a  baseline  for  the  achievable 
performance  from  training  algorithms. 

The  experimental  results  are  summarized  in  Figure  5. 
These  experiments  used  only  two  features  in  the 
parametric  interpolation,  corresponding  to  two  different 
values  of  thresholds.  In  Figure  5,  four  pairs  of  features 
are  considered  along  the  x-axis.  For  each  pair  of 
features,  the  rollout  performance  is  evaluated  with 


optimal  weights,  and  with  trained  weight.  Rollout 
performance  is  also  evaluated  for  each  of  the  features 
used  in  isolation. 

In  Figure  5,  the  upper  dashed  line  represents  the  best 
performance  achieved  using  the  Monte  Carlo  rollout 
approach,  and  the  lower  dashed  line  represents  the 
performance  of  the  greedy  heuristic.  The  dotted  line 
indicates  the  performance  using  two  statistically 
determined  features  combined  with  equal  weights. 

The  results  of  Figure  5  are  surprising,  in  that  the 
algorithms  based  on  off-line  training  seldom  approach 
the  performance  achieved  by  the  optimal  weighted 
combination.  Figure  5  shows  that  the  rollout  algorithm 
based  on  features  with  trained  weights  was  not  able  to 
offer  a  consistent  and  significant  improvement  over  the 
greedy  heuristic.  In  some  cases  the  combination  of 
features  with  trained  weights  were  not  able  to  perform 
as  well  as  the  features  individually. 
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Figure  5  Rollout  Performance  with  Feature 
Pairs 

Figure  5  shows  that  the  optimally  selected  weights 
using  the  features  0.6  and  0.9  achieved  an  overall 
performance  close  to  that  of  the  Monte  Carlo  approach. 
However,  when  other  pairs  of  features  were  used,  the 
performance  was  significantly  worse.  An  alternative  to 
training  or  optimization  is  to  select  the  parameters 
analytically.  The  dotted  line  in  Figure  5  shows  the 
performance  of  a  combination  of  features  that  were 
selected  with  equal  weights  to  “match”  the  statistical 
distribution  (mean  and  variance)  of  risk  within  the 
problem.  This  approach  provides  a  significant 
improvement  over  the  greedy  heuristic  without  the 
computational  cost  of  training  or  optimization.  This 
approach  appears  worthy  of  further  investigation  due  to 
its  simplicity. 

In  sum,  rollout  algorithms  using  parametric 
approximations  did  not  perform  as  well  as  rollout 
algorithms  using  Monte  Carlo  simulations. 


6  Conclusions 


In  this  paper  we  have  considered  the  use  of  rollout 
algorithms  for  adaptive  multi-platform  scheduling  in  a 
risky  environment.  We  explored  different  variations  of 
rollout  algorithms,  using  combinations  of  on-line  Monte 
Carlo  simulation  and  parametric  approximations.  Our 
experimental  results  show  that  rollout  algorithms  using 
on-line  simulation  perform  significantly  better  than  the 
reference  base  heuristic  policies,  using  only  a  modest 
number  of  Monte  Carlo  trajectories. 

In  our  experiments,  we  found  that  rollout  algorithms 
based  on  parametric  approximations  to  the  cost-to-go 
failed  to  achieve  the  level  of  performance  of  similar 
rollout  algorithms  using  on-line  Monte  Carlo 
simulations.  The  parametric  approximations  suffered 
from  two  limitations:  First,  the  training  techniques  often 
failed  to  identify  the  best  weight  combinations.  Second, 
the  parametric  approximations  were  unable  to  generalize 
accurately  across  the  broad  class  of  states  which 
occurred  in  the  problem.  Our  experiments  were  limited 
to  simple  classes  of  parametric  approximations  using  the 
concept  of  certainty  equivalence  scenarios.  Exploration 
of  alternative  approximations  using  different  features  is 
an  area  for  future  investigations. 

The  main  limitation  of  the  Monte  Carlo  rollout 
algorithms  is  the  amount  of  on-line  computation 
required  to  evaluate  the  different  options  at  each  state. 
We  are  currently  investigating  techniques  based  on 
discrete-event  systems  and  perturbation  analysis  [4]  to 
reduce  the  number  of  simulations  required  to  evaluate 
multiple  alternatives. 
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Closed- Loop  Control  for  Joint  Air  Operations 

Jerry  M.  Wohletz,  David  A.  Castanon,  and  Michael  L.  Curry 


Abstract — This  paper  focuses  on  the  problem  of  providing 
real-time,  closed-loop  feedback  control  of  Joint  Air  Opera¬ 
tions  (JAO)  via  near-optimal  mission  assignments.  For  this 
application,  a  rollout  algorithm  is  employed  which  is  based 
on  the  theory  of  stochastic  dynamic  programming.  The  pri¬ 
mary  benefits  of  this  technology  are  agile  and  stable  control 
of  distributed  stochastic  systems.  The  rollout  algorithm  is 
applied  to  a  small  JAO  scenario  that  includes  limited  assets, 
risk/reward  that  is  dependent  on  mission  composition,  ba¬ 
sic  threat  avoidance  routing,  and  multiple  targets,  some  of 
which  are  fleeting  and  emerging.  Simulation  results  illus¬ 
trate  the  benefits  of  the  closed-loop  feedback  control.  It  is 
shown  that  the  rollout  strategy  provides  statistically  signifi¬ 
cant  performance  improvements  over  an  open-loop  feedback 
strategy  that  uses  the  same  baseline  heuristic.  The  perfor¬ 
mance  improvements  are  attributed  to  the  fact  that  the  roll¬ 
out  algorithm  was  able  to  learn  near-optimal  behaviors  that 
were  not  modeled  in  the  baseline  heuristic. 

Keywords — Large-scale  control,  approximate  dynamic  pro¬ 
gramming,  stochastic  systems,  adaptive  control. 

I.  Introduction 

CURRENTLY  ,  air  operations  are  executed  according 
to  an  Air  Tasking  Order  (ATO)  which  is  developed  ev¬ 
ery  24  hours.  If  you  included  the  end-to-end  development 
time  —  the  air  operations  planning,  tasking,  and  executing 
—  the  process  takes  72  hours.  Given  the  dynamic  nature  of 
air  operations  where  things  can  change  in  a  matter  of  min¬ 
utes,  this  open-loop  control  strategy  suffers  from  a  lack  of 
agility.  Moreover,  given  our  current  information  gathering 
capability,  our  awareness  of  the  battle  space  has  never  been 
greater.  The  goal  of  this  research  is  to  take  advantage  of 
the  battle  space  information  and  develop  closed-loop  con¬ 
trol  strategies  to  improve  the  agility  of  air  operations. 

Ideally,  air  packages  are  assembled  and  assigned  to  tar¬ 
gets  such  that  their  coordinated  effect  efficiently  achieves 
a  campaign  objective  with  the  available  resources.  How¬ 
ever,  due  to  the  inherently  uncertain  JAO  environment, 
i.e.  popup  threats,  time  critical  targets,  asset  destruction, 
etc.,  the  original  tasking  should  be  adapted  whenever  a 
significant  event  occurs  in  order  to  achieve  the  campaign 
objective.  If  these  abrupt  events  are  not  anticipated,  the 
possible  modifications  may  be  so  constrained  that  signifi¬ 
cant  performance  degradation  results. 

In  this  paper,  the  JAO  problem  is  view  as  a  stochastic 
control  problem  and  an  Approximated  Stochastic  Dynamic 
Programming  (ASDP)  technique  known  as  the  Rollout  Al¬ 
gorithm  (RA)  is  applied.  This  control  strategy  anticipates 
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future  significant  events,  and  hedges  assets  for  the  oppor¬ 
tunity  of  recourse.  Thus,  the  resulting  missions  can  be 
adapted  to  contingencies  with  minimal  performance  degra¬ 
dation,  resulting  in  robust,  stable  control. 

In  Section  II,  an  overview  of  the  control  methodology  is 
presented.  Simulation  results  for  a  small  JAO  scenario  are 
presented  in  Section  III.  Finally,  conclusions  are  presented 
in  Section  IV. 

II.  Methodology 

The  JAO  environment  is  an  uncertain  dynamical  sys¬ 
tem  that  has  the  following  attributes:  control  decisions 
made  over  time;  probabilistic  transition  from  one  state  to 
the  next,  which  is  dependent  on  the  choice  of  control;  and 
risk/rewards  that  are  accumulated  during  each  transition, 
which  is  dependent  on  control  and  state  transition  out¬ 
come.  Thus,  the  tasking  of  air  packages  in  a  JAO  envi¬ 
ronment  can  be  viewed  as  a  sequential  decision  problem 
where  each  decision  is  based  on  the  observations  of  certain 
discrete  events. 

This  class  of  problems  can  be  formulated  as  a  Markov 
decision  problem  [1].  The  principal  approach  for  solving 
such  problems  is  Stochastic  Dynamic  Programming  (SDP). 
Using  the  SDP  formulation,  an  optimal  control  solution 
is  computed  off-line,  and  on-line  computation  is  reduced 
to  feedback  rule  evaluation  or  table  lookup  interpolation. 
However,  it  is  well  known  that  this  approach  suffers  from 
the  curse  of  dimensionality  and  is  intractable  for  realisti¬ 
cally  sized  JAO  problems. 

A  subtle  but  significant  attribute  of  the  SDP  formula¬ 
tion  is  that  it  explicitly  models  the  feedback  mechanism, 
i.e.  the  dependence  of  future  control  decisions  on  future 
information  arrival,  in  the  mathematical  formulation.  By 
modeling  the  feedback  mechanism  in  the  mathematical  for¬ 
mulation,  the  SDP  formulation  generations  proactive  ver¬ 
sus  reactive  solutions  that  are  able  to  anticipate  likely  con¬ 
tingencies;  this  trait  produces  guaranteed  superior  perfor¬ 
mance  for  stochastic  problems  [2].  This  proactive  attribute 
is  imperative  for  stable  and  agile  control  of  the  JAO  enter¬ 
prise  because  future  information  arrival  and  control  oppor¬ 
tunities  are  dependent  on  stringent  spatial,  temporal,  and 
coordination  constraints. 

Note,  in  this  paper,  we  have  adopted  Dreyfus’s  [2]  termi¬ 
nology  with  respect  proactive  versus  reactive  feedback  con¬ 
trol  solutions.  Closed-Loop  Feedback  (CLF)  will  refer 
to  feedback  control  solutions  that  explicitly  model  the  feed¬ 
back  mechanism,  and  Open-Loop  Feedback  (OLF)  will 
refer  to  feedback  control  solutions  that  neglect  the  feedback 
mechanism  in  the  control  optimization  problem. 

Given  the  well-known  strengths  and  weaknesses  of  the 
SDP  formulation,  there  has  been  a  great  deal  of  research 
on  ASDP  methods  in  recent  years.  These  methods  gen- 
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erally  maintain  the  SDP  structure,  but  use  a  variety  of 
techniques  to  approximate  the  optimal  cost-to-go.  In  this 
paper,  we  apply  one  such  technique  known  as  the  Rollout 
Algorithm  [3],  [4]  to  the  JAO  problem.  The  rollout  algo¬ 
rithm  —  which  has  been  used  for  a  wide  variety  of  dynamic 
decision  problems  [4],  [5],  [6],  [7]  —  is  a  technique  that  ex¬ 
ploits  knowledge  of  a  suboptimal  decision  rule  to  obtain  an 
approximate  cost-to-go  for  use  in  the  SDP  framework.  Be¬ 
cause  the  RA  maintains  the  SDP  structure,  it  falls  within 
the  category  of  closed-loop  control;  accordingly,  the  termi¬ 
nology  RA  and  closed-loop  feedback  will  be  used  synony¬ 
mously  in  this  paper. 

An  overview  of  the  RA  is  presented  below.  Consider  a 
discrete  event  version  of  a  dynamic  decision  problem, 

Xk+I  =  fk{Xk,Uk,Wk)  (1) 

where  Xk  is  the  state  taking  values  in  some  set  Uk  is 
the  control  to  be  selected  from  a  finite  set  Uk{xk)^Wk  is  a 
random  disturbance,  and  fk  is  a  given  function.  We  assume 
that  the  disturbance  Wk^k  =  0,1,...  has  a  given  distribu¬ 
tion  that  depends  explicitly  only  on  the  current  state  and 
control.  Define  a  control  policy,  which  is  a  sequence  of  feed¬ 
back  functions  imp  each  state  Xk  to  control  a 

Uk- 

TTk  ~  {/^/:  M/s-f  1  (^fc+1 )  5  *■•5  Mfe+yV— 1  1 )}  (2) 

thus,  the  control  at  time  k  is  Uk  —  fJ^ki^k)  ^  Ukixk)-  In 
the  A'-stage  horizon  problems  considered  herein,  the  single- 
stage  cost  function  is  denoted  by  Qki^k^  f^k{xk)iUJk)  and  the 
terminal  cost  function  is  denoted  by  Gk+N{xk-\-N)-  The 
cost-to-go  for  policy  tt  starting  from  state  Xk  at  time  k  can 
be  computed  as  follows: 

Gk+N{Xk+N)  +  SiiXi,fli{Xi),Wi)\ 

(3) 

and  can  be  represented  in  the  SDP  recursion  format  as 
follows 

J^{xk)  =  E  {gkixk,fJikixk),Wk)  +  Jk+iifix:k,f^kixk),Wk))} 

(4) 

for  all  k  and  with  the  initial  condition 

Jk-hN  -  Gk-^N{Xk-{-N)  (5) 

The  A”-stage,  SDP  solution  is  as  follows 

ttI  =  a.vgmm^^^uieUi{xi)E{gkixk,g,k{xk),Wk) 

+  Jk+lifi^k,Hk{Xk),'U}k))}  ^ 

The  RA  exploits  this  formulation  by  replacing  the  control 
mapping  for  times  k  +  1  k  +  N  ~1  with  a  predetermined 
baseline  heuristic  fi{xi).  Additionally,  the  RA  is  solved 
forward  in  time,  and  is  computed  at  the  actual  state  Xk 
versus  all  possible  states  at  time  k.  Thus,  the  RA  has  the 
following  policy: 

^k°  =  {ukixk),llixk+\),-,liixk+N-i)}  (7) 


Using  this  policy,  the  approximate  optimal  control  solution 
at  time  k  is 

=  &rgmm^^^u^(^:,^)E{gk{xk,Uk,Wk) 

+  Jk+1  {f{xk,Uk,Wk))^ 

Thus,  the  rollout  policy  is  a  one-step  lookahead  policy  with 
the  optimal  cost-to-go  approximated  by  the  cost-to-go  of 
the  base  policy.  The  RA  computes  the  best  control  at  the 
current  state  a;^  at  time  k  by  balancing  the  current  cost 
with  an  approximate  cost-to-go  using  a  baseline  heuristic 
to  model  future  control  decisions. 

The  computation  of  the  cost-to-go  of  the  base  policy  is 
by  no  means  trivial.  When  the  number  of  states  is  very 
large,  the  recursion  may  be  infeasible.  A  straightforward 
approach  for  computing  the  rollout  control  at  a  given  state 
Xk  and  time  k  is  to  use  Monte  Carlo  simulations  of  the 
baseline  policy.  To  implement  this  approach,  we  consider 
all  possible  controls  Uk  G  U{xk)  and  generate  a  “large” 
number  of  simulation  trajectories  starting  from  Xk-  Thus, 
the  simulated  trajectory  has  the  form: 

^  fiXk.UkyWk) 

Xi^i  =  f{xi,fl{xi)^Wi)  i  —  k  +  l,...^k N  -  1 

(9) 

The  costs  corresponding  to  these  trajectories  are  averaged 
to  obtain  the  Q-factor 

Q{xk,Ui)  =  E  ^gk{xk.,uuWk)  +  Jkli  {fixk,Ui,m))^ 

(10) 

Due  to  the  finite  number  of  simulations,  only  an  esti¬ 
mate  for  the  Q-factor  Q{xk,Ui)  is  obtained.  The  approx¬ 
imation  becomes  increasingly  accurate  as  the  number  of 
simulation  trajectories  increases.  Once  the  estimated  Q- 
factor  Q{xkjUi)  corresponding  to  each  candidate  control 
Ui  G  U{xk)  is  computed,  the  optimal  rollout  control  at 
time  k  for  state  Xk  is 

uf^  =  arg  min  Q{xk,Uk)  (11) 

Uk€Uk{xk) 

Thus,  the  RA  starts  with  a  baseline  heuristic  and  im¬ 
proves  the  policy  by  using  on-line  learning  and  simulation. 
The  algorithm  is  related  to  the  SDP  formulation  and  is 
based  on  policy  iteration  ideas.  As  a  result,  the  RA  is  not 
as  ambitious  as  the  SDP,  and  only  provides  modest  guar¬ 
antees  of  near-optimality  [4] .  It  is  an  intermediate  method¬ 
ology  between  heuristics  and  SDP. 

III.  Simulation  Results 

This  section  presents  implementation  and  simulation  re¬ 
sults  of  the  RA  applied  to  a  simplified  JAO  environment. 
The  purpose  of  this  implementation  is  to  highlight  the  ben¬ 
efits  of  approximate  optimal  control  for  the  JAO  enterprise. 
Thus,  results  will  be  presented  for  open-loop  (OL),  open- 
loop  feedback  (OLF),  and  closed-loop  feedback  (CLF)  con¬ 
trollers,  and  both  performance  and  behavioral  characteris¬ 
tics  will  be  highlighted.  All  controllers  generate  mission  or¬ 
ders  based  on  the  current  state  of  the  JAO  environment.  A 
mission  order  includes  target  assignment,  coarse  level  rout¬ 
ing,  air  package  composition,  and  desired  time-on-target. 
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A.  Demonstration  Scenario 

As  a  proof  of  concept,  a  small  and  simple  scenario,  il¬ 
lustrated  in  Figure  1,  is  used  to  demonstrate  the  benefits 
of  CLF  for  the  JAO  enterprise.  The  scenario  has  a  single 
airbase,  located  to  the  left,  that  includes  three  strike  and 
three  weasel  aircraft.  Enemy  assets  include  two  surface- 
to-air  missiles  sites  (SAMs)  and  six  targets  of  which  T2  is 
fleeting  and  T4  is  either  fleeting  — Scenario  A —  or  time 
critical  — Scenario  B. 


{Launch ^Threat  Engage,Target  Engage^  Land}.  As 
mention  at  the  beginning  of  this  section,  the  output  of 
the  control  optimization  problem  presented  in  Section  II  is 
mission  orders  that  consist  of  list  of  air  packages  assigned 
to  targets.  An  air  package  APi  is  defined  as  the  product 
composition  of  n  strike  aircraft  ACs  and  m  weasel  aircraft 
AC^: 

APi  -  ACs^  X  •  •  •  X  ACs^  X  ACwi  x  *  •  *  x  (12) 


The  important  attributes  of  this  scenario  include  limited 
assets,  risk/reward  dependent  on  package  composition,  ba¬ 
sic  threat  avoidance  routing,  and  multiple  targets,  some  of 
which  are  fleeting  and  emerging.  Asset  attrition  is  highly 
likely  since  SAMs  occupy  the  airspace  between  the  airbase 
and  targets.  Also,  given  the  number  of  targets  and  limited 
assets,  multiple  strike  packages  and  waves  will  be  required 
to  service  the  targets.  Finally,  the  performance  in  this 
scenario  is  governed  by  the  controllers’  ability  to  manage 
attrition  while  servicing  the  fleeting  targets. 

The  state  dynamics  in  (1)  can  be  represented  as  a  dis¬ 
crete  event  system  using  a  finite  state,  stochastic  timed 
automaton  formulation [9].  As  will  be  shown,  the  state 
dynamics  can  be  constructed  through  composition  of  in¬ 
dividual  asset  automata.  Accordingly,  the  individual  as¬ 
set  dynamics,  i.e.  aircraft,  threats,  and  target,  will  be 
presented  followed  by  the  composition  of  these  dynam¬ 
ics  into  a  high-level,  stochastic  timed  automaton.  To 
simplify  this  discussion,  the  notion  of  an  automaton  for 
each  asset  will  be  illustrated  via  a  state  transition  dia¬ 
gram.  The  state  transition  diagram  for  an  aircraft  is  il¬ 
lustrated  in  Figure  2.  Based  on  this  diagram,  the  aircraft 


state  is  defined  as  A!ac  =  [Base,  Ingress^  Egress^  Dead} 
and  significant  event  set  is  defined  as  £ac  — 


Accordingly,  the  composite  state  is  defined  as  A^APi  — 
{A^aCs^  j  ■  ■  ■  5  '^AC^^  }  and  the  event  set  is  defined  as  SAPi  = 
{Launch^  T hr eatEngage ^Target Engage^  Land}.  The  dis¬ 
tinguishing  feature  between  strike  and  weasel  aircraft  is 
that  a  strike  aircraft  destroy  targets  where  weasel  air¬ 
craft  destroy  SAMs.  A  SAM  may  destroy  both  aircraft 
types.  As  illustrated  in  the  above  diagram,  all  transitions 
are  deterministic  with  the  exception  of  the  threat  engage¬ 
ment.  For  the  threat  engagement  event,  the  state  transi¬ 
tion  p{xAPk+x  \^APk  j  ^SAMk .  Threat  Engage)  is  dependent 
on  both  air  package  state  xap  and  the  threat  state  xsaMi 
and  is  thus  defined  as  an  interaction  event. 

The  state  transition  diagram  for  a  target  is  illus¬ 
trated  in  Figure  3.  Thus,  the  target  state  A!x  ~ 


Fig.  3.  Target  State  Transition  Diagram 

[Known,  Unknown,  Dead}  and  significant  events  £t  = 
{Emerge,  Hide,  Target  Engage}.  For  the  scenario  pre¬ 
sented  in  Figure  1,  there  are  four  normal  targets,  Ti,  T3,  T5, 
and  Te  that  have  the  following  constraints  :po{Known)  =  1 
and  neither  {Hide}  or  {Emerge}  is  a  triggering  event 
w.p.  0.  Target  is  a  fleeting  target  with  the  following 
constraints:  po{Known)  =  1  and  {Emerge}  is  a  trigger¬ 
ing  event  w.p.  0.  For  Scenario  A,  target  T4  is  fleeting 
and  has  identical  transitions  to  target  J2.  For  Scenario  B, 
target  T4  is  a  time  critical  target  with  the  following  con¬ 
straints:  poiUnknown)  —  1.  For  all  of  these  target  types, 
the  state  can  only  transition  to  Dead  from  the  Known 
state,  and  this  interaction  event  is  governed  by  transition 
matrix  p{xTk^.^  \^APk  i^Tk^T arget  Engage) . 

Finally,  the  state  transition  diagram  for  the  SAM  is  il¬ 
lustrated  in  Figure  4.  Thus,  the  SAM  state  A!sam  = 
{Active,  Inactive,  Dead}  and  significant  events  £sam  = 
{Activate,  Deactivate,  Threat  Engage}.  It  is  impor¬ 
tant  to  note  that  the  threat  engagement  event  can 
only  occur  if  the  SAM  is  Active,  i.e.  radar  on,  and 
this  transaction  is  governed  by  the  transition  matrix 
p{xsAMk+i \xAPk ,  xsAMk ,  Threat  Engage) . 

Thus,  state  dynamics  (1)  can  be  constructed  through 
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Fig.  4.  SAM  State  Transition  Diagram 


parallel  composition  ||  of  the  individual  asset  automata: 


Xk+l  =  ^Pi||---||^Pn||ri||---||r6||5^Mi||5AM2 

”  ( j  17(3:^ ) ,  'p{^^k-\-  \  \  ^k  ?  )  1  Vo (^o) ) 

(13) 

where  X  is  the  composite  state  space  for  each  asset,  8  is 
the  composite  event  set,  r(3:)  is  a  state  dependent  set  of 
feasible  events,  i,e.  r{x)  C  £  y  x  e  p{xk-{-i\xk,ek-\-i)  = 
^{xk-hi  =  Xk^i  |a:jfc ,  ejk-f-i }  is  the  state  transition  probability 
defined  for  all  Xk-\-i,Xk  €  X,  e^+i  €  and  such  that 
p{xk+i\xk,ek-^i)  =  0  V  e/k+i  ^  r(a;ik),  po{x)  =  P{x  = 
Xo}  y  X  e  X,  and  Gij.fc  is  the  clock  structure  for  event 
i  e  air  package  j  G  {APi, . . .  ,APn},  and  enemy  asset 
k  G  {Ti, . . . ,  T^,SAMi,  SAM2].  Note,  the  subscripts  i  or 
j  may  be  omitted  if  the  interaction  is  irrelevant. 

The  definition  of  the  first  five  terms  of  the  automaton  in 
(13)  follow  from  the  discussion  in  the  previous  paragraphs, 
and  all  that  remains  is  to  define  the  stochastic  clock  struc¬ 
ture  Gij^k  that  defines  the  triggering  event.  To  simplify  the 
composite  clock  structure,  time  k  has  been  normalized  into 
unit  increments  T  where  each  increment  corresponds  to  a 
transition  from  one  waypoint  to  another  in  Figure  1.  Thus, 
it  take  2.5T  to  transition  from  the  base  to  a  target;  likewise, 
it  takes  6T  to  perform  a  cycle:  5T  for  a  base-target-base 
transition  and  T  to  turn  aircraft  around.  Another  simpli¬ 
fication  is  introduced  by  limiting  the  control  loop  closures, 
i.e.  formation  and  launching  new  air  packages,  to  time 
increments  of  w3T  where  ly  G  Z  >  0  is  the  wave  number. 

Given  these  simplifications,  the  clock  structure  for  a 
given  air  package  APj  launched  a,t  k  =  3wT  is  as  follows: 


Giu^PAj 

Gde.PAj^SAMi 

GtR.PAj^SAMi 

GAC,PAj,SAMi 

GDE,PAj,SAM2 

GTR,PAj,SAM2 

GAC,PAj,SAM2 

GTA,PAj,Tk 

GlN^PAj 


{w;3T}  w.p.  1 

{{w3  +  l)T,  {w3  +  b)T}  w.pO.S 
{(u;3  -f  1)T,  (?i;3  +  5)T}  w.p.  1 
{(w;3  +  2)T,  (u;3  +  6)T}  w.p  1 
{{w3  -h  2)T,  (ii;3  +  4)T}  w.p.  0.5 
{{w3p2)T,  {w3-^A)T}  w.p.  1 
{(t/;3-h2)T,(^i;3-f4)T}  w.p.  1 
{{w3  -h  3)T}  w.p.  1 
{{w3  -h  6)T}  w.p.  1 


where  Giu^PAj  -  {w3T}  w.p.  1  implies  P{PAj  Launch 
@  k  —  w3T}  =  1.  Additionally,  the  clock  sequences  for 
targets  T2  and  T4,  which  are  dependent  on  clock  time  cT 
where  c  G  Z  >  0,  are  as  follows: 


Ghd,T2  =  {10^}  '^•P-  1 
GhD.Ta  “  '^’P'  1 

or 

Gem,t,  =  P{Smerpe<cr}  =  E- 0.2(1 -0.2)»-i 
Ghd,T4  =  {Gem,T4  +  OT}  w.p.  1 


where  the  distinction  is  made  for  Scenario  A  and  B. 

Due  to  this  time  synchronization  and  the  fact  that 
multiple  air  packages  APi  ||  •  •  •  ||APn  maybe  in  a  given 
wave,  multiple  events  will  occur  at  the  same  time.  As 
a  result,  a  priority  rule  is  imposed  such  that  all  feasible 
non-interaction  events,  i.e.  {Launch^  Land,  Emerge,  Hide, 
Activate,  Deactivate} ,  occur  first  followed  by  feasible  inter¬ 
acting  events,  i.e.  {Threat  Engage, Target  Engage}.  In 
the  case  of  multiple  air  packages  engaging  the  same  SAM, 
the  engagements  are  treated  separately  and  the  triggering 
events  are  executed  in  the  order  of  lowest  to  highest  pack¬ 
age  number. 

B.  Controller  Implementation 

As  noted  in  the  previous  section,  control  loop  closures, 
i.e.  optimization  problem  solved  to  determine  formation 
and  launching  new  air  packages,  occur  on  time  intervals  of 
wST  where  ii;  G  Z  >  0  is  the  wave  number.  For  this  sce¬ 
nario,  the  objective  of  the  control  problem  is  to  service  as 
many  targets  possible  while  preserving  our  own  force.  To 
quantify  this  objective,  a  valuation  scheme  is  used.  Tar¬ 
gets  are  valued,  depicted  in  Figure  1,  from  25  to  150,  strike 
aircraft  at  20,  weasel  aircraft  at  40,  and  no  values  are  as¬ 
signed  to  the  SAMs.  Thus,  the  objective  function  may  be 
represented  as  follows: 

=  E  {zUtM^tAN)  ^  Dead) 

+  Ei  -20  1{XAC.,  (N)  =  Dead)  (14) 
+  Ef-40  1(a:^c„,(iV)  =  Dead)} 

where  1  {xaCs  (^)  =  Dead)  in  an  indicator  function  and 
equals  1  if  xac^^  {^)  —  Dead  and  0  otherwise,  and  N  =  oo 
is  the  planning  horizon  used  for  this  implementation.  As 
noted  in  Section  II,  the  Q-factor  estimate,  Q{xk,Ui)  is  sub¬ 
stituted  for  (14),  and  the  optimal  rollout  control  G 
U{xk)  is  determined  from  (11).  For  this  implementation,  10 
simulations  trials  where  used  to  compute  Q{xk,Ui).  Also, 
the  admissible  control  set  U{xk)  can  be  summarized  as  fol¬ 
lows:  aircraft  at  base  can  be  sent  out  on  missions  to  specific 
targets,  for  which  the  minimum  risk  route  is  determined  a 
priori,  or  they  can  stay  at  base  in  reserve. 

As  illustrated  in  Section  II,  the  RA  uses  a  baseline  heuris¬ 
tic  p{xi)  to  model  future  decisions.  For  this  implementa¬ 
tion,  a  generic  greedy  planning  heuristic  that  generates  a 
complete  set  of  mission  orders  is  used.  The  heuristic  is 
greedy  because  for  each  target,  the  best  mission  package  is 
selected  without  consideration  to  the  available  assets  and 
the  mission  priority,  which  is  proportional  to  the  value  of 
the  target.  The  details  of  this  algorithm  are  as  follows: 
for  each  target  Ti,  the  best  mission  packaged-consisting 
of  n  strike  and  m*  weasel  aircraft — is  determined  via  the 
following  maximization: 

(n*,m*)  =  argmaxij<2  [JTiP{xTi{k)  =  Dead} 

(k)  ^  Dead} 

-40Y:T  P{^Acl,ik)  =  Dead}] 

/  [20n  -f  40m] 


(15) 
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the  probabilities  are  computed  by  propagating  a  Markov 
chain  from  time  k  =  ZwT  to  A:  =  ^wT  -h  6T  using  the 
dynamics  in  (13)  with  only  a  single  air  package  with  n  strike 
and  m  weasel  aircraft  Note,  k  =  3wT  +  6T  corresponds 
to  the  end  of  the  planning  horizon  which  is  a  single  wave 
launched  at  3wT. 

A  few  properties  of  the  greedy  heuristic  are  noteworthy. 
First,  as  a  simplification,  the  target’s  window  of  vulnera¬ 
bility  is  not  modeled  in  the  Markov  chain;  Couple  this  with 
the  1-wave  lookahead,  the  heuristic  is  not  capable  of  recog¬ 
nizing  the  importance  of  aggressively  prosecuting  fleeting 
and  time  critical  targets.  Additionally,  resource  constraints 
are  not  modeled.  As  a  result,  the  greedy  algorithm’s  output 
is  a  complete  list  of  mission  orders  and  should  be  viewed 
as  a  launch  queue  where  only  the  highest  priority  missions 
are  launched  given  the  available  resources. 

C.  Simulation  Results 

In  this  section,  RA  simulation  results  are  presented  for 
both  versions  of  the  JAO  scenario  described  above.  As  a 
basis  of  comparison,  these  results  will  be  presented  relative 
to  a  loose  Stochastic  Upper  Bound  (SUB)  and  relative  to 
both  OL  and  OLF  controller  implementations.  For  both 
versions  of  the  JAO  scenario,  a  SUB  is  computed  opti¬ 
mistically  by  determining  the  expected  value  E{J(xoo)} 
(14)  while  assuming  no  threats,  and  is  thus  a  loose  bound. 

The  comparison  of  the  RA  with  OL  and  OLF  will  quan¬ 
tify  the  performance  benefits  of  feedback  control  over  OL 
and  the  performance  benefit  of  a  proactive  versus  reactive 
control.  For  the  OL  and  OLF  controller  implementations, 
the  greedy  heuristic  described  in  Section  III-B  is  used  to 
generate  mission  orders.  In  the  OL  implementation,  a  list 
of  mission  orders  for  each  target  is  determined  based  on  the 
state  of  the  environment  at  A:  =  0,  and  this  plan  is  not  up¬ 
dated.  In  the  OLF  implementation,  the  mission  orders  are 
updated  for  each  loop  closure  time  k  =  wZT  for  u;  €  Z  >  0. 

C.l  Scenario  A  Results 

Figure  5  shows  the  Scenario  A  performance  results  of  the 
OL,  OLF,  and  CLF  strategies.  These  results  show  that  the 
RA  outperforms  the  OLF  strategy,  which  attains  a  statis¬ 
tically  significant  improvement  over  the  corresponding  OL 
strategy.  It  is  seen  that  the  optimal  control  framework 
was  able  to  achieve  86%  of  a  loose  SUB  whereas  the  OLF 
was  only  able  to  achieve  67%  of  the  bound,  and  the  OL 
controller  was  only  able  to  achieve  45%  of  the  bound.  The 
performance  improvement  is  attributed  to  the  fact  that  the 
RA  develops  strategies  that  are  not  inherent  in  the  base¬ 
line  heuristic.  For  Scenario  A,  these  strategies  include  stag¬ 
ing  packages  and  opening  attack  corridors  to  manage  as¬ 
set  attrition,  aggressively  prosecuting  fleeting  targets,  and 
reserving  assets  for  likely  contingencies.  Thus,  a  simple, 
generic  heuristic  used  in  the  rollout  framework  does  gener¬ 
ate  near-optimal  behaviors  and  results. 

As  an  example,  at  A:  —  0  when  all  targets  and  SAMS 
are  alive,  the  RA  sends  a  APi  [0, 2]  T4  followed  by  a 
AP2[2,  1]  T4  and  leaves  the  remaining  strike  at  base. 
Note,  APi[n,m]  — V  Tj  is  shorthand  for  APi  with  n  strike 


OL  OLF  CLF  SUB 


Fig.  5.  Open-Loop,  Open-Loop  Feedback,  and  Closed-Loop  Feedback 
Results  for  Scenario  A  (86  Monte-Carlo) 

and  m  weasels  assigned  to  Tj,  For  this  same  situation, 
the  greedy  heuristic  recommends  sending  a  APi  [2,2]  T2 
instead  of  T4  and  leaving  the  remaining  strike  and  weasel 
aircraft  at  base  since  alternative  missions  appear  too  risky. 
This  decision  encapsulates  many  of  the  behavioral  charac¬ 
teristics  that  are  learned  by  rolling  out  candidate  controls. 
The  RA  sends  the  APy  [0,2]  T4  out  first  to  manage  attri¬ 
tion  by  opening  an  attack  corridor.  This  mission  is  followed 
by  a  AP2[2j1]  T4,  which  is  time  critical;  a  weasel  air¬ 

craft  is  included  in  this  package  in  the  event  that  a  SAM  is 
alive.  Furthermore,  by  modeling  the  next  loop  closure  at 
A:  =  3T,  a  strike  aircraft  is  held  in  reserve  for  the  potential 
opportunity  to  launch  at  T2  in  the  event  that  both  SAMs 
are  destroyed.  This  provides  a  maximum  of  two  strike  op¬ 
portunities  on  this  high  valued  target.  Thus,  the  RA  learns 
the  concept  of  time  and  managing  attrition  without  coach¬ 
ing  since  neither  of  these  traits  is  modeled  in  the  baseline 
heuristic.  Granted,  a  more  sophisticated  baseline  heuristic 
could  have  been  used;  however,  by  customizing  the  baseline 
heuristic,  generality  would  be  lost  which  is  clearly  undesir¬ 
able  for  applicability  to  other  JAO  scenarios.  But  the  point 
must  be  stressed,  that  by  modeling  the  feedback  mecha¬ 
nism,  the  RA  was  able  to  position  assets  for  opportunities 
of  recourse,  and  this  resulted  in  better  performance  over 
the  OLF  implementation. 

C.2  Scenario  B  Results 

Figure  6  shows  the  Scenario  B  performance  results  of  the 
OL,  OLF,  and  CLF  strategies.  Again,  these  results  show 
that  RA  outperforms  the  OLF  strategy,  which  attains  a 
statistically  significant  performance  improvement  over  the 
corresponding  OL  strategy.  It  is  seen  that  the  optimal 
control  framework  was  able  to  achieve  81%  of  a  loose  SUB 
whereas  the  OLF  was  only  able  to  achieve  73%  of  the  bound 
and  the  OL  controller  was  only  able  to  achieve  52%  of  the 
bound.  Like  Scenario  A,  the  performance  improvement  is 
attributed  to  the  fact  that  the  RA  develops  strategies  that 
are  not  inherent  in  the  baseline  heuristic.  For  Scenario  B, 
these  strategies  include  developing  a  Combat  Air  Patrol 
(CAP)  over  the  emerging  target  region,  staging  packages 
and  opening  attack  corridors  to  manage  asset  attrition,  ag¬ 
gressively  prosecuting  fleeting  targets,  and  reserving  assets 
for  likely  contingencies. 

Of  these  behaviors,  the  CAP  is  the  most  interesting.  Be- 
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Fig.  6.  Open-Loop,  Open-Loop  Feedback,  and  Closed-Loop  Feedback 
Results  for  Scenario  B  (200  Monte- Carlo) 


cause  of  the  uncertain  value  function  for  T4,  the  RA  devel¬ 
ops  a  policy  of  continuously  maintaining  an  ingress  mission 
to  T4  as  soon  as  possible.  Note,  this  simulation  environ¬ 
ment  does  not  have  loiter  capability.  At  A:  =  0  when  all 
targets  and  SAMs  are  alive,  the  RA  sends  a  APi  [0, 1]  T2 
followed  by  a  AP2[2,2]  T2  and  leaves  the  remaining 
strike  at  base.  The  greedy  heuristic’s  recommendation  is 
the  same  as  it  was  for  Scenario  A.  The  RA  sends  the  APi 
out  first  to  manage  attrition  by  opening  an  attack  corri¬ 
dor.  The  mission  is  then  followed  by  a  AP2  to  aggressively 
prosecute  T2  so  that  it  can  free  up  resources  to  establish 
the  CAP.  Given  that  T4  emerges  on  average  at  «  4T,  the 
reserve  aircraft  will  be  in  position  to  launch  at  the  next 
loop  closure  k  ~  3T.  Likewise,  when  the  first  wave  returns 
to  base,  these  aircraft  will  immediately  be  turned  around 
to  continue  the  CAP.  This  process  will  continue  until  the 
target  emerges  and  expires.  The  remaining  normal  targets 
are  then  serviced. 

Viewing  the  results  presented  in  Figure  5  and  Figure  6, 
it  appears  that  OL  and  OLF  performance  increases  for  Sce¬ 
nario  B.  It  is  true  that  there  is  a  performance  percentage 
increase;  however,  this  is  a  result  of  the  invariance  of  the 
OL  and  OLF  results  and  a  decrease  in  the  SUB  between 
the  two  versions.  The  OL  and  OLF  results  are  statisti¬ 
cally  equivalent  because  neither  control  strategy  services 
T4  regardless  whether  the  value  function  is  deterministic 
or  stochastic.  On  the  other  hand,  the  decrease  in  the  SUB 
and  the  resulting  reduction  in  rollout  performance  reflect 
the  difficulty  of  Scenario  B.  Because  of  the  requirement 
to  maintain  an  ingress  mission  to  r4  in  anticipation  of  the 
target  emerging,  fewer  aircraft,  on  average,  are  able  service 
the  high-value,  fleeting  target  T2. 

IV.  Conclusions 

This  paper  focuses  on  the  problem  of  providing  a  military 
commander  with  real-time,  closed-loop  feedback  control  of 
Joint  Air  Operations  (JAO)  via  near-optimal  mission  as¬ 
signments,  which  anticipate  possible  mission  modifications 
due  to  uncertain  future  events.  Based  on  the  theory  of 
stochastic  dynamic  programming,  an  approximate  optimal 
control  strategy  known  as  the  rollout  algorithm  is  presented 
in  this  paper.  In  this  framework,  the  feedback  mechanism, 
i.e.  the  dependence  of  future  control  decisions  on  future 
information  arrival,  is  explicitly  modeled  in  the  control  op¬ 


timization  problem;  as  a  result,  future  significant  events 
are  anticipated  and  assets  are  hedged  for  opportunities  of 
recourse.  Thus,  the  resulting  missions  can  be  adapted  to 
contingencies  with  minimal  performance  degradation,  re¬ 
sulting  in  robust,  stable  control.  The  rollout  algorithm 
is  applied  to  a  small  JAO  scenario  that  includes  limited 
assets,  risk/reward  that  is  dependent  on  package  composi¬ 
tion,  basic  threat  avoidance  routing,  and  multiple  targets, 
some  of  which  are  fleeting  and  emerging.  Simulation  results 
illustrate  the  benefits  of  the  approximate  optimal  control 
strategy.  It  is  shown  that  the  rollout  algorithm  provides 
statistically  significant  performance  improvements  over  an 
open-loop  feedback  strategy  that  uses  the  same  baseline 
heuristic.  The  performance  improvements  are  attributed 
to  the  fact  that  the  rollout  algorithm  is  able  to  learn  near- 
optimal  behaviors  —  establishing  combat  air  patrol  over 
time  critical  areas,  staging  packages  and  opening  attack 
corridors  to  manage  friendly  asset  attrition,  aggressively 
prosecuting  fleeting  targets,  and  reserving  assets  for  con¬ 
tingencies  —  that  are  not  modeled  in  the  baseline  heuristic. 
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ABSTRACT 

A  key  component  of  a  Joint  Air  Operation  (JAO)  environment  is  the  planning  and  dynamic  control  of  missions 
in  the  presence  of  uncertainties.  This  involves  the  assignment  of  resources  (e.g.,  different  aircraft  types)  to  targets 
while  taking  into  account  and  anticipating  the  effect  of  random  future  events  and,  subsequently,  dynamic  control  in 
response  to  various  controllable  and  uncontrollable  events  as  missions  are  executed  in  a  hostile  and  rapidly  changing 
setting.  The  objective  is  to  maximize  the  reward  associated  with  targets  while  minimizing  loss  of  resources.  In  this 
paper,  we  first  formulate  the  problem  of  optimal  mission  assignment  and  identify  the  complexities  involved  due  to 
combinatorial  and  stochastic  characteristics.  We  then  describe  a  discrete  event  simulation  tool  developed  to  model 
the  JAO  environment  and  all  of  its  dynamics  and  stochastic  elements  and  to  provide  a  testbed  for  several  methods 
we  are  developing  to  solve  the  problem  of  agile  mission  control.  We  describe  some  of  these  methods,  including 
approximate  dynamic  programming  using  rollout  algorithms  and  optimal  resource  allocation  schemes,  and  present 
some  numerical  results. 

Keywords:  Discrete  Event  System,  Stochastic  Dynamic  Programming,  Simulation,  Mission  Planning. 

1.  INTRODUCTION 

The  Joint  Air  Operations  (JAO)  environment  may  be  viewed  as  a  stochastic  dynamic  system  in  which  “entities” 
such  as  aircraft,  threats  (e.g.,  hostile  air  defenses),  and  targets  interact,  and  a  variety  of  events  take  place,  some  of 
which  are  controlled  (e.g.,  the  decision  of  an  aircraft  to  engage  a  target)  and  some  are  random  (e.g.,  an  aircraft  being 
destroyed  by  a  threat).  Traditionally,  operations  are  carried  out  based  on  a  predefined  plan,  typically  created  as  an 
Air  Tasking  Order  (ATO)  about  24  hours  in  advance  of  an  actual  mission.  This  approach,  however,  lacks  “agility”, 
since  it  cannot  anticipate  changes  in  the  battle  space  nor  take  advantage  of  ever-increasing  sensor  capabilities  that 
can  provide  additional  information  on  a  continuous  basis.  Therefore,  a  critical  need  is  to  develop  dynamic  control 
mechanisms  that  not  only  incorporate  anticipative  capabilities  regarding  future  uncertain  events,  but  are  also  able 
to  swiftly  react  to  observed  events  and  make  adjustments  (e.g.,  retasking  an  airborne  aircraft  or  aborting  a  mission). 

In  order  to  accomplish  the  goal  of  agile  control,  two  essential  tasks  need  to  be  carried  out.  First,  we  need  to 
develop  an  appropriate  modeling  framework  for  the  JAO  environment.  Since  this  environment  is  extremely  complex, 
one  is  tempted  to  develop  a  highly  detailed  model,  normally  in  a  simulation  setting.  While  such  a  model  may  contain 
great  descriptive  value,  it  is  often  of  little  use  for  prescriptive  purposes,  i.e.,  as  the  basis  for  deriving  the  agile  control 
schemes  we  seek.  This  is  because  higly  detailed  models  are  not  only  computationally  intensive  so  that  real-time 
applications  are  out  of  the  question,  but  they  often  obscure  the  salient  features  which  are  needed  to  determine  the 
right  action  for  a  given  situation  (the  analogy  of  “missing  the  forest  for  the  trees”  applies  here).  Therefore,  the 
challenge  is  to  identify  the  appropriate  level  of  modeling  which  will  yield  “just  enough”  useful  information  to  make 
optimal  decisions.  The  second  task  is  that  of  formulating  and  solving  optimization  problems,  based  on  an  appropriate 
battle  space  model,  which  capture  the  fundamental  objective  of  JAO:  maximize  the  reward  associated  with  targets 
while  minimizing  the  loss  of  friendly  assets.  While  the  formulation  of  such  problems  can  be  quite  straightforward 
when  an  appropriate  model  is  available,  their  solution  is  far  from  feasible  due  to  a  variety  of  complexities  which  we 
will  discuss.  1 

In  this  paper,  we  present  a  stochastic  dynamic  model  for  the  battle  space,  which  is  designed  to  suit  the  JAO  level 
of  detail.  This  is  accomplished  by  adopting  a  Discrete  Event  System  (DES)  framework^  and  identifying  events  and 


state  transitions  which  capture  the  key  features  of  the  processes  describing  battle  space  entity  interactions  (Section 
2).  This  framework  forms  the  basis  of  a  simulation  tool  we  have  developed,  which  is  briefly  described  in  Section 
3.  In  Section  4,  we  formulate  a  stochastic  dynamic  optimization  problem  the  solution  of  which  provides  the  agile 
control  mechanism  desired.  The  complexity  of  such  a  problem,  however,  is  such  that  obtaining  an  exact  solution  is 
infeasible.  One  must  therefore  seek  approximate  solution  methodologies.  Some  such  methodologies  are  presented  in 
Wohletz  et  al.,^^  where  the  advantages  of  closed-loop  (dynamic)  control  schemes  over  open-loop  (static)  schemes 
are  illustrated.  In  this  paper,  we  include  a  more  recent  approach  that  combines  a  rollout  strategy,  as  described  in 
Wohletz  et  al.,^^  with  the  “surrogate  problem”  method  presented  in  Gokbayrak  and  Cassandras.® 

2.  MODELING  FRAMEWORK 

We  adopt  a  view  of  the  battle  space  as  consisting  of  several  interacting  entities: 

1.  Friendly  assets  which  originate  at  bases.  These  assets  are  further  classified  into  different  types,  such  as  Strike 
Aircraft  (SA),  Wild  Weasels  (WW),  and  Jammers.  SA  carry  munition  to  be  delivered  to  targets.  WW  provide 
support  to  the  SA  to  protect  them  from  enemy  threats,  while  Jammers  provide  electronic  air  defense  suppression 
capabilities. 

2.  Threats  are  typically  air  defense  resources,  such  as  SAMs,  which  may  engage  aircraft  on  their  way  to  various 
targets  or  may  be  positioned  so  as  to  directly  protect  a  target. 

3.  Targets  are  the  ultimate  destination  of  SA  and  they  may  contain  their  own  air  defense  resources  that  can  engage 
friendly  assets. 

A  mission  is  the  process  of  creating  a  “packcige”  of  friendly  assets  located  at  a  base  and  assigning  it  to  a  target. 
The  package  is  subsequently  also  assigned  a  route  from  base  to  target.  For  the  purpose  of  JAO,  we  do  not  care  to 
model  the  flight  of  each  member  of  the  package  in  detail.  Instead,  we  represent  the  route  from  base  to  target  as  a  set 
of  “way  points”  {Wq,  Wj, . . . ,  W^},  where  Wq  denotes  the  base  and  Wn  denotes  the  target.  A  way  point  is  selected 
to  represent  a  predefined  point  along  the  route  or  a  known  threat.  In  addition,  there  may  be  unknown  threats  in 
between  way  points,  in  which  case  the  model  must  automatically  create  a  new  way  point  for  the  mission.  The  travel 
time  between  way  points  is  a  random  variable  based  on  aircraft  speed  and  location  of  way  points.  Thus,  a  typical 
mission  consists  of  traveling  from  one  way  point  to  the  next  and  possibly  engaging  threats  that  may  either  be  known 
or  unknown.  When  (and  if)  the  package  arrives  at  its  target,  it  engages  it  and  subsequently  must  return  to  base  by 
reversing  its  route  (which  may  be  modified  depending  on  new  information). 

The  state  of  a  mission  is  defined  by  the  number  of  each  aircraft  type  in  the  package  that  are  still  alive  and 
the  location  of  the  package.  In  addition,  the  state  may  include  remaining  munitions  for  each  aircraft  and  other 
information  that  we  shall  not  take  explicitly  into  account  at  this  level  of  modeling.  The  state  of  a  target  is  binary 
representing  whether  it  is  “alive”  or  “destroyed”.  The  state  of  a  threat  is  similarly  defined.  Thus,  the  overall  state 
of  the  battle  space  is  described  by  the  states  of  all  missions,  threats,  and  targets. 

In  order  to  systematically  capture  the  dynamics  of  the  battle  space  as  its  entities  interact,  as  well  as  the  effect 
of  uncertain  factors,  we  view  it  at  the  level  of  a  DBS  and  adopt  a  Stochastic  Timed  Automaton  model.®  A  simple 
automaton  is  defined  by  (A,  E,r,/,a;o),  where  (z)  A  is  a  countable  set  of  states,  (ii)  E  is  a  countable  set  of  events, 
{Hi)  r{x)  is  the  set  of  feasible  events  when  the  state  is  x;  this  is  a  subset  of  E  containing  all  events  which  are  allowed 
to  occur  at  state  x,  (iv)  /(x,  e)  is  a  state  transition  function  such  that  when  event  e  occurs  at  state  x  the  next  state 
is  x'  =  /(x,  e);  this  can  easily  be  replaced  by  a  probabilistic  mechanism  such  that  the  next  state  x'  is  determined 
with  probability  p(x';x,  e),  (v)  xq  is  a  given  initial  state.  To  obtain  a  timed  automaton,  a  clock  mechanism  is  added 
so  that  the  stochastic  DBS  can  determine  the  next  event  to  occur  at  state  x,  since  r(x)  generally  contains  more  than 
one  feasible  events.  Thus,  whenever  an  event  occurs  and  the  state  is  x,  every  event  i  G  r(x)  is  associated  with  a 
clock  value  yi  and  the  next  event  to  occur,  denoted  by  e',  is  the  one  with  the  smallest  clock  value;  formally: 

e'  =  arg  min  {t/i} 

i€r(x) 

When  such  an  event  occurs,  the  next  state  is  simply  given  by 

x'  =  f{x,e') 


(1) 


FROM  OTHER  ASSETS 


Figure  1.  Mission  related  events 


Moreover,  if  state  x  was  entered  at  time  t,  then  the  new  system  time  after  event  e'  occurs  is  given  by 

t'  =  t-\-  min  (yj 
ler(x)' 

Observe  that  time  is  only  updated  when  an  event  occurs;  all  battle  space  activity  in  between  events  is  irrelevant  to 
this  model,  a  fact  that  maintains  simplicity  and  computational  efficiency.  At  time  t'  all  events  that  remain  feasible 
in  the  new  state  have  their  clocks  decremented  by  setting 

y\  =  yi-  .min  {!/i} 

zGr(x) 

since  minj£r(x){yi}  units  have  elapsed  since  the  last  event.  An  event  such  that  i  G  r(x)  and  i  ^  r(x')  is 
eliminated  and  its  clock  discarded.  Finally,  if  e'  G  r(x'),  this  event  is  assigned  a  new  clock  value  Vi  (also  referred 
to  as  the  event’s  lifetime).  In  a  stochastic  timed  automaton,  this  value  is  a  random  variable  characterized  by  some 
distribution  Gi{-).  In  a  simulation  setting,  Vi  is  a  sample  from  Gi{-)  obtained  through  a  pseudo-random  number 
generator.  Further  details  are  omitted  but  may  be  found  in  Cassandras  and  Lafortune.^ 

The  event  set  E  required  to  model  the  battle  space  includes  all  events  that  may  cause  a  state  transition  in  any  of 
the  entities  we  have  defined.  The  key  events  that  describe  a  typical  mission  state  evolution  axe  shown  in  Figure  1  in 
the  sequence  in  which  they  may  occur.  The  mission  is  initiated  at  a  base  with  a  LOAD  event,  representing  the  process 
of  loading  munitions  and  configuring  all  assets  to  be  included  in  the  mission  package.  When  this  is  completed,  a 
DEPARTJBASE  event  becomes  feasible  and  may  be  scheduled.  After  this  occurs,  the  package  is  physically  airborne, 
on  its  way  to  the  next  way  point  in  its  route.  Omitting  way  points  that  involve  no  threat  encounters,  the  next 
event  shown  is  either  DETECT^THREAT  or  ARRIVE-TARGET.  In  the  former  case,  a  threat  engagement  process 
is  initiated,  whose  details  we  omit.  After  END_ENGAGE  occurs,  the  states  of  both  the  threat  and  the  package  will 
generally  change.  If  no  engagement  actually  takes  place  (e.g.,  because  of  the  effect  of  jamming),  then  ENDJENGAGE 
simply  coincides  with  DETECT..THREAT.  Note  that  a  new  DETECT^THREAT  may  occur,  since  subsequent  way 
points  can  include  additional  threat  encounters.  It  is  also  possible  the  package  is  destroyed  or  decides  to  abort  the 
mission  and  return  to  base,  as  shown  in  Figure  1.  In  the  case  where  the  package  reaches  the  target,  a  similar  process 
takes  place. 

In  addition,  it  is  possible  that  the  package  makes  a  rerouting  decision,  which  may  consist  of  aborting  the  mission, 
selecting  a  new  target,  or  specifying  a  new  way  point  that  includes  a  threat  encounter.  It  is  at  this  point  that  control 
decisions  are  critical  and  the  role  of  feedback  and  solving  an  optimization  problem  enter  the  whole  process.  Moreover, 
note  that  these  decisions  may  well  depend  on  information  supplied  by  other  packages,  which  creates  a  “cooperative 
control”  setting.  In  some  cases,  for  example,  it  is  possible  to  combine  two  or  more  packages  and  define  a  new  mission 
or  split  a  package  up  so  that  other  missions  are  provided  with  additional  assets.  Although  details  are  omitted,  all 
these  possibilities  are  part  of  this  DES  setting. 


Figure  2,  Typical  screen  snapshot  of  simulation  tool 

3.  SIMULATION  MODEL 

Using  the  Stochastic  Timed  Automaton  structure  described  in  the  previous  section,  a  discrete-event  simulation  model 
was  developed  that  includes  a  graphical  user  interface  for  defining  the  battle  space  and  for  monitoring  all  mission 
activity.  A  typical  screen  shot  is  shown  in  Figure  2,  with  a  magnified  view  provided  through  a  zooming  in  and 
out  capability  shown  in  Figure  3.  The  simulation  environment  allows  one  to  define  all  battle  space  entities  and 
their  attributes  (e.g.,  target  location  and  value,  threat  location  and  range).  When  a  simulation  is  executed,  a  state 
trajectory  unfolds  which  may  be  graphically  observed,  while  a  detailed  event  trace  is  also  provided  (see  bottom  of 
screen  in  Figure  2). 

The  main  value  of  such  a  simulation  tool  is  to  allow  us  to  test  different  mission  planning  and  control  strategies. 
In  addition,  the  underlying  simulation  engine  may  also  be  used  as  an  integral  part  of  some  controllers  which  involve 
estimating  alternative  performance-related  quantities. 

4.  OPTIMAL  MISSION  PLANNING  AND  AGILE  CONTROL 

The  dynamics  of  the  battle  space,  viewed  as  a  DES,  are  described  by  the  state  transition  mechanism  of  the  stochastic 
timed  automaton  previously  described.  Thus,  letting  fc  =  1,2, . . .  index  all  events  that  take  place  over  a  given  time 
interval,  we  can  rewrite  the  state  transition  function  (1)  as 

Xfc+i  =  f{xk,Uk,Wk)  (2) 

where  Xk  €  A  is  the  battle  space  state,  Uk  is  a  control  decision  which  can  be  made  following  the  kth  event  and 
which  is  selected  from  a  finite  set  Uk{xk)^  and  Wk  represents  random  factors  that  affect  the  state.  In  general,  the 
control  decision  Uk  is  a  function  of  the  observed  state  Xk,  so  we  can  write  Uk  =  fJ'k{xk)  ^  Uk{xk)>  A  control  policy  tt^ 
applied  for  a  time  horizon  that  includes  N  events  into  the  future  is  a  sequence  of  functions  that  map  each  state  Xi 
to  a  control  Ui  for  all  events  i  =  . . .  ,k  N  —  1: 

In  order  to  make  optimal  choices  whenever  a  decision  is  made,  cost  functions  gk{xk,  iJ^k{xk)^  '^k)  for  all  fc  =  1, 2, . . .  are 
defined  that  quantify  the  immediate  effect  of  selecting  Uk  =  In  addition,  a  terminal  cost  function 


Figure  3.  Magnified  screen  snapshot  of  simulation  tool 


is  defined.  Then,  the  total  expected  cost  incurred  by  a  policy  initiated  at  k  (also  known  as  the  expected  cost-to-go) 
is 


k-hN-l 


i—k 


(4) 


which  can  also  be  rewritten  recursively  as 


=  E  {gk{xk,iik{xk),Wk)  +  Jfc+i(/(x]t,/ffc(xfc),w;fc))}  ,  fc  =  1,2,.. .  (5) 


with  the  initial  condition 


Jk+N  =  Gk+N{Xk+N) 


(6) 


This  is  the  basis  of  the  well-known  Stochastic  Dynamic  Programming  (SDP)  algorithm^  which  provides  a  solution 
to  the  problem  of  determining  tt^  =  }  minimizing  the  total  expected  cost;  in  particular,  the  optimal 

policy  is  obtained  from 


Wfc  =  arg  min  E  [gk{xk,Uk,Wk)  +  Jk+i{f{xk,Uk,Wk))] 

ukeUkixk) 


(7) 


where  is  the  optimal  cost-to-go  starting  at  fc  -h  1  with  —  Gfc+jv(a:fc+iv)*  Although  in 

principle  one  can  apply  this  algorithm  to  obtain  an  explicit  solution  to  the  problem,  this  approach  is  computationally 
intractable  for  all  but  very  simple  problems.  When  it  comes  to  using  it  in  the  JAO  context,  there  are  four  broad 
areas  of  complexity  we  are  facing:  combinatorial^  stochastic^  distributed^  and  computational  First,  the  problem  is 
combinatorially  complex  since  the  number  of  states  and  control  options  grows  exponentially  with  the  number  of  assets 
and  targets.  Second,  the  time  scale  of  interest  in  mission  control  is  long,  which  results  in  a  high  degree  of  future 
uncertainty  contributed  by  (among  many  factors)  sensor  inaccuracies,  unexpected  hostile  actions,  and  deviations 
from  a  plan  upon  its  execution.  Another  difficulty  arises  from  distributed  complexity,  which  includes  collection 
and  management  of  information  from  different  assets,  together  with  coordination  and  communication  strategies  to 
achieve  cooperative  control.  Finally,  computational  complexity  is  due  to  the  need  for  real-time  decisions  which  must 
normally  be  implementable  onboard  an  aircraft. 


One  of  the  most  common  ways  used  to  overcome  these  complexities  is  based  on  “decomposition”  of  the  overall 
problem  into  smaller,  more  manageable  components.  If  such  decomposition  is  time-based,  we  can  first  solve  an 
optimization  problem  aimed  at  determining  Uk  =  p^ki^k)  that  minimizes 


instead  of  (4)  over  a  limited  time  horizon  reflected  by  the  choice  of  iV.  Since  there  are  often  events  following  which 
control  actions  may  not  be  taken,  it  is  reasonable  to  decompose  a  campaign  in  this  fashion,  choosing  appropriate 
time  intervals  based  on  “significant  events”.  For  example,  when  packages  return  to  base  it  is  reasonable  to  collect 
all  returning  assets  and  plan  new  missions;  on  the  other  hand,  an  event  such  as  detecting  a  threat  may  not  warrant 
re-evaluation  of  all  possible  control  actions.  Although  (8)  is  a  static  optimization  problem  requiring  the  specification 
of  a  single  control  Ufc,  rather  than  a  sequence  it  is  still  a  very  challenging  task.  The  drawback 

of  this  approach  is  that  it  prevents  the  controller  from  making  decisions  that  take  into  account  future  events  beyond 
the  selected  time  horizon.  As  an  example,  if  a  time-critical  high-value  target  is  identified  for  some  future  time  outside 
the  present  decision  horizon,  the  controller  is  not  able  to  reserve  adequate  resources  to  define  a  mission  for  this  target. 
To  deal  with  this  issue,  there  are  two  options.  First,  one  can  extend  the  time  horizon  and  increase  the  dimensionality 
of  the  the  control  vector  to  account  for  future  decisions,  thus  trading  off  performance  for  computational  efficiency. 
Alternatively,  one  can  approximate  the  expected  cost-to-go  by  replacing  /Xi(xi)  in  (4)  iox  i  >  k  with  a  baseline 
heuristic  At  each  subsequent  event,  the  problem  can  be  resolved  “rolling  out”  the  time  horizon  forward,  hence 

the  term  “rollout  algorithm”^  which  was  used  in  Wohletz  et  al.^^  In  this  case,  the  approximate  optimal  control 
obtained  at  k  is 


=  arg  min  E 

Uk€Uk{xk) 


gk{xk,Uk,Wk)  +  Jk+i  {f{xk,Uk,Wk)) 


(9) 


Thus,  the  rollout  algorithm  seeks  the  best  control  at  state  that  includes  the  current  one-step  cost  and  an  approx¬ 
imate  cost-to-go  obtained  through  a  baseline  heuristic  to  model  future  decisions.  However,  evaluating  the  expected 
value  (/(^fc, UA;,u?fc))]  is  still  a  difficult  problem.  More  generally,  the  problem  is  to  estimate  the  so-called 

Q-function 

Q{xk,Ui)  =  E  ^^gkixk,u^,Wk)  +  Jk+i{f(xk,Ui,Wk))'^  (10) 

for  any  control  Uj  G  U{x}^).  One  approcah  is  to  estimate  this  expectation  through  simulation.  Thus,  using  the 
simulation  model  discussed  earlier  we  can  generate  sample  paths  starting  at  state  Xk  and  satisfying 


=  f{Xk,Uk,Wk) 

2^1+1  =  i  =  fc  +  1,  -h  A  -  1 


(11) 


The  resulting  estimate  is  denoted  by  Q{xk,Ui).  Thus,  this  approach  reduces  to  solving  an  optimization  problem  to 
determine 

u^^  =  axg  min  Q{xk,Uk)  (12) 

Uk^Uk{xk) 

This  problem  is  far  from  simple,  as  the  dimensionality  of  Uk{xk)  is  typically  enormous  (in  the  billions  of  possible 
control  choices).  In  the  remainder  of  this  paper,  we  will  discuss  an  approach  for  its  solution,  based  on  the  premise 
that  the  cost  function  Q{xk^Uk)  can  be  adequately  approximated. 


4.1.  Optimal  Mission  Planning 

We  begin  by  using  the  model  discussed  in  Section  2  for  a  JAO  setting  in  order  to  identify  the  precise  structure  of  the 
control  choice  set  Uk{^k)  when  the  state  is  x^-  Let  us  assume  that  this  state  reflects  an  initial  condition  whereby 
missions  are  being  assigned  to  various  targets  at  some  base.  Let  M  be  the  number  of  targets  this  base  is  responsible 
for  and  let  Q  be  the  number  of  different  asset  types  that  may  be  used  to  design  strike  packages  (for  our  purposes, 
we  set  Q  =  3  representing  the  three  types  of  aircraft  configurations  mentioned  earlier,  i.e.,  SA,  WW,  and  Jammers). 
The  control  decision  can  be  expressed  as  a  vector  of  dimensionality  M  •  Q  oi  the  form 


Vj  —  .  .  .  7^1  Q,  •  •  *  •  •  ■  7  UrJVf  q]  (1^) 

where  Ui^q  is  the  number  of  assets  of  type  q  allocated  to  target  i.  The  set  of  possible  decisions  at  this  state,  denoted 
by  [/,  is  limited  by  the  capacity  constraints 


M 

^  ^  Q  ~  1,  .  .  .  ,  Q  (14) 

i  i=l 

where  Kq  is  the  total  number  of  assets  of  type  q  available  at  the  base.  There  may  also  be  additional  constraints  such 
as  Pi^q  <  Ui^q  <  'yi^q  for  i  =  1, . . . ,  M,  if  a  package  assigned  to  target  i  is  not  allowed  to  exceed  assets  of  type  q 


and  is  required  to  include  at  least  assets.  For  example,  a  constraint  may  be  imposed  that  no  mission  can  use 
more  than  a  given  number  of  WW  or  that  it  must  include  at  least  one  SA. 

As  already  mentioned,  the  optimization  problem  we  are  interested  in  formulating  is  intended  to  maximize  the 
reward  obtained  from  successful  destruction  of  targets  while  minimizing  the  cost  of  asset  loss.  In  order  to  formulate 
such  a  problem,  we  associate  a  value  Vi  with  the  2th  target  and  a  cost  Cq  with  an  asset  of  type  g.  In  addition,  let 
P^{u)  denote  the  probability  of  successfully  destroying  target  i  under  a  control  vector  u,  and  P^iu)  the  probability 
that  an  asset  of  type  q  is  lost  during  the  execution  of  the  ith  mission  under  u.  Then,  ignoring  any  future  missions, 
the  optimization  problem  is  to  determine  u  in  (13)  so  as  to  maximize  the  total  expected  reward  of  the  mission 


M 


j{u)  =  j2 


i-1  L  q-l 


(15) 


subject  to  the  constraint  (14)  and  possibly  more  mission-dependent  constraints.  Note  that  for  some  choices  of  u  it  is 
possible  that  J{u)  <  0.  Clearly,  the  solution  of  this  problem  requires  knowledge  of  the  probabilities  P^ (u)  and  Pi{u) 
for  2  =  1, ... ,  M,  g  =  1 . . . ,  Q  and  for  all  possible  u  ^  U.  These  probabilities  depend  on  factors  such  as  the  outcome  of 
engagements  with  threats  and  targets  and  the  possible  cooperation  across  missions  that  may  fly  over  common  threats, 
as  well  as  basic  parameters  such  as  the  firing  rate  of  aircraft  and  the  effectiveness  of  various  weapons.  It  is  possible  to 
evaluate  Pf{u)  and  Pi{u)  analytically  using  the  stochastic  timed  automaton  model  of  Section  2  enhanced  by  more 
detailed  engagement  models  and  by  making  some  simplifying  assumptions.  Alternatively,  it  is  possible  to  estimate 
them  through  simulation,  although  this  becomes  a  prohibitively  time-consuming  task.  Yet  another  approach  is  not 
to  attempt  to  solve  an  explicit  optimization  problem,  but  rather  to  rely  on  heuristics  such  as  the  “greedy  heuristic” 
described  in  Wohletz  et  al.^"^ 

Note  that  the  solution  of  (15)  does  not  include  any  future  decisions  beyond  a  single  wave  of  missions.  Looking 
at  (10),  this  problem  considers  only  the  first  term  and  not  the  expected  cost-to-go  term  {/{x^.Uk.Wk))]^ 

However,  using  a  rollout  approach  as  described  earlier,  this  can  be  combined  with  (15)  so  that  the  problem  we  face 
is  (12)  with  the  control  decision  being  of  the  form  (13). 

In  what  follows,  we  will  present  a  methodology  based  on  the  “surrogate  problem”  idea®  and  provide  some 
numerical  examples  comparing  it  to  some  alternatives. 


4.2.  The  “Surrogate  Problem”  Method 

A  crucial  difficulty  with  the  problem  of  maximizing  J{u)  in  (15)  is  that  the  control  vector  u  in  (13)  is  discrete.  This 
prevents  us  from  using  optimization  techniques  from  conventional  nonlinear  programming  with  continuous  decision 
variables,  which  are  typically  based  on  gradient  information.  The  alternative  is  to  rely  on  methods  for  discrete 
optimization.  However,  even  in  a  deterministic  setting  this  class  of  problems  is  NP-hard  and  one  must  rely  on  some 
form  of  a  search  algorithm  (e.g.,  Simulated  Annealing,^  Genetic  Algorithms'^).  In  a  stochastic  environment  such 
as  in  JAO,  the  problem  is  further  complicated  by  the  need  to  estimate  the  objective  function  of  interest,  such  as 
Q{xh,  Uk)  in  (12).  This  generally  requires  Monte  Carlo  simulation  or  direct  measurements  made  on  the  actual  system. 
Most  known  approaches  are  based  on  some  form  of  random  search,  as  in  algorithms  proposed  by  Yan  and  Mukai,^® 
Gong  et  al,^  Shi  and  Olafsson.^®  Another  recent  contribution  to  this  area  involves  the  ordinal  optimization  approach 
presented  in  Ho  et  aP®  and  used  by  Cassandras  et  aP  to  solve  a  class  of  resource  allocation  problems.  The  main 
difficulty  with  all  these  methods  is  that  they  are  more  suited  to  be  off-line  approaches  lacking  the  real-time  speed 
required  in  JAO. 

The  key  idea  in  the  “surrogate  problem”  method  introduced  by  Gokbayrak  and  Cassandras®  and  generalized  by 
the  same  authors’^  is  to  transform  the  discrete  optimization  problem  (12)  into  a  “surrogate”  continuous  optimization 
problem  which  is  solved  using  standard  gradient-based  methods;  its  solution  is  then  transformed  back  into  a  solution 
of  the  original  problem.  Thus,  suppose  that  we  begin  by  relaxing  the  integer  constraint  on  all  Ui^q  in  (15)  so  that 
they  can  be  regarded  as  continuous  (real- valued)  variables.  The  resulting  “surrogate”  problem  then  becomes:  Find 
p*  €  Uc  that  minimizes  the  cost  function  Jc(p)  over  the  continuous  set  Uc,  i-e., 

Jc(p*)  =  min  Jc{p)  =  min  F;^[L^(p. u)]  (16) 

f  peUr.  peUr 

where  p  =  [pi,i>  •  •  •  i  '  *  *  ?  ^  Pi,q  ^  is  a  real-valued  vector,  Uc  is  a  constraint  set  such  that 

U  C  Uc,  and  Lc{p,iv)  is  the  cost  function  over  a  specific  sample  path  (denoted  by  uj)  when  the  state  is  p.  Omitting 


details  which  may  be  found  in  Gokbayrak  and  Cassandras®/  we  outline  below  the  basic  “surrogate  problem”  scheme. 
Initially,  we  set  the  surrogate  control  vector  to  be  that  of  the  actual  one,  i.e.,  po  =  uq.  To  avoid  dealing  with  integer 
values  in  the  surrogate  problem,  let  us  perturb  the  components  of  Uq  by  arbitrary  small  amounts  ^  0  as  long  as 
the  constraints  are  not  violated,  so  that 

Po  =  '^0  +  ^ 

Subsequently,  at  the  nth  step  of  the  process,  let  Hn{un^(^n)  denote  an  estimate  of  the  sensitivity  of  the  cost  Jc{Pn) 
with  respect  to  pn  obtained  over  a  sample  path  ujn  of  the  actual  system  operating  under  control  lin*  Two  sequential 
operations  are  then  performed  at  the  nth  step: 

1.  The  continuous  state  p^  is  updated  through 

Pn+l  “  7n4-l[Pn  “  ^n)] 

where  7n+i  is  a  projection  function  onto  the  set  Uc  so  that  pn+i  €  Uc,  depending  on  the  nature  of  the  set  Uc, 
and  7]n  is  a  “step  size”  parameter 

2.  The  newly  determined  control  vector  of  the  surrogate  problem,  Pn+i,  is  transformed  into  an  actual  feasible 
discrete  vector  of  the  original  system  through 

^n+l  ~  /n-fl(PnH-l) 

where  /n+i  :  C/c  f/  is  a  mapping  of  feasible  continuous  controls  to  feasible  discrete  controls  which  must  be 
appropriately  selected. 

One  can  recognize  in  (17)  the  form  of  a  stochastic  approximation  algorithm^^  that  generates  a  sequence  {pn} 
aimed  at  solving  (16).  However,  there  is  an  additional  operation  (18)  for  generating  a  sequence  {nn}  which  we  would 
like  to  see  converge  to  tt*,  the  solution  of  (15).  It  is  important  to  note  that  {nn}  corresponds  to  feasible  realizable 
controls  based  on  which  one  can  evaluate  estimates  Hn{un,uJn)  from  observable  data,  i.e.,  a  sample  path  of  the 
actual  system  under  Un  (not  the  surrogate  control  pn)-  We  can  therefore  see  that  this  scheme  is  intended  to  combine 
the  advantages  of  a  stochastic  approximation  type  of  algorithm  with  the  ability  to  obtain  sensitivity  estimates  with 
respect  to  discrete  decision  variables.  In  particular,  sensitivity  estimation  methods  for  discrete  parameters  based  on 
Concurrent  Simulation^  are  ideally  suited  to  meet  this  objective. 

The  cornerstones  of  this  method  are  the  selection  of  the  mapping  /n+i  in  (18)  and  of  a  surrogate  cost  function 
Lc{PtUj)  whose  relationship  to  the  actual  cost  must  be  made  explicit.  In  addition,  the  estimates  Hn{un:UJn)  necessary 
for  the  optimization  scheme  described  above  must  be  obtained.  These  are  dicussed  in  detail  in  Gokbayrak  and 
Cassandras.*^ 

4.3.  A  Mission  Planning  Scenario  Example 

We  illustrate  our  approach  to  optimal  mission  planning  using  a  sample  scenario  shown  in  Figure  4.  In  this  scenario 
there  are  M  =  16  targets  with  different  values  protected  by  16  SAM  sites  as  shown.  Our  task  is  to  plan  missions  at 
a  base  with  two  asset  types  (i.e.,  Q  =  2),  SA  and  WW,  such  that  we  have  at  our  disposal  Ki  =  20  SA  and  K2  —  S 
WW.  We  then  seek  a  32 —dimensional  vector  u  =  [uij,  •  ’  ■ ,  •  •  •  5  '^^16,2]  to  maximize  a  reward  function  of  the 

form  (15)  or,  if  we  approximate  the  cost-to-go  {f{xk,Uk,Wk))]  in  (10),  we  can  minimize  Q{xk,Uk)  in  (12). 

As  shown  in  Figure  4,  packages  are  assumed  to  fly  in  a  straight  line  between  base  and  assigned  target  and  must, 
therefore,  engage  threats  on  their  way  to  a  target  or  during  the  returning  part  of  the  mission.  The  number  of  possible 
solutions  to  this  problem  is  about  10^®,  and  that  is  only  reflective  of  the  combinatorial  complexity  aspect  of  it. 

An  added  complication  in  this  problem  is  the  fact  that  there  are  multiple  local  optima.  The  “surrogate  problem” 
method  provides  an  attractive  means  of  dealing  with  this  difficulty  because  of  its  convergence  speed.  Our  approach 
in  this  case  is  to  randomize  over  the  initial  controls  uq  (equivalently,  po)  and  seek  a  (possibly  local)  minimum 
corresponding  to  this  initial  point.  The  process  is  repeated  for  different,  randomly  selected,  initial  controls  so  as  to 
seek  better  splutions.  For  deterministic  problems,  the  best  allocation  seen  so  far  is  reported  as  the  optimal.  For 
stochastic  problems,  we  adopt  the  stochastic  comparison  approach  in  Gong  et  al.^  The  algorithm  is  run  from  a 
randomly  selected  initial  point  and  the  cost  of  the  corresponding  final  point  is  compared  with  the  cost  of  the  “best 
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Figure  4.  A  mission  planning  scenario  example 


Performance 


Figure  5.  Performance  comparison  over  different  solution  approaches  (single  mission  wave) 


point  seen  so  far” .  The  stochastic  comparison  test^  is  applied  to  determine  the  “best  point  seen  so  far”  for  the  next 
run. 

The  surrogate  problem  approach  was  applied  for  a  single  mission  wave  and  its  performance  was  compared  to  three 
alternatives  as  shown  in  Figure  5.  Here,  “Greedy”  refers  to  a  simple  greedy  heuristic  used  also  in  Wohletz  et  aP"^ 
in  which  targets  are  ordered  on  the  basis  of  their  value  and  missions  are  assigned  from  most  to  least  valuable  target 
without  takii^g  into  account  the  number  of  available  assets  at  the  base.  Another  heuristic,  labeled  “MMR”,  uses 
the  maximal  marginal  return  in  expected  mission  reward  and  assigns  packages  based  on  this  metric,  computed  using 
an  analytical  model  that  includes  detailed  engagements  and  a  limited  amount  of  interdependencies  of  packages  in 
engaging  threats  (therefore,  it  tends  to  be  conservative).  Finally,  “Rollout”  refers  to  a  rollout  algorithm^  applied  to 
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Figure  6.  Total  campaign  value  comparison  over  diflferent  solution  approaches  (multiple  mission  waves) 


a  single  mission  wave.  Note  that  performance  in  these  results  is  measured  as  the  fraction  of  total  attainable  reward 
with  respect  to  the  total  target  value.  In  addition  to  the  scenario  described  above,  the  results  in  Figure  5  include  an 
average  over  a  number  of  randomly  generated  scenaria. 

In  the  case  of  a  campaign  with  multiple  mission  waves,  the  surrogate  problem  method  v/as  applied  independently 
from  one  wave  to  the  next  and  the  performance  results  are  shown  in  Figure  6  (which  includes  confidence  intervals).  In 
this  case,  “MMR+Surrogate”  refers  to  the  use  of  the  MMR  algorithm  to  determine  an  initial  point  for  the  surrogate 
problem  method,  which  accelerates  its  convergence.  Note  that  this  method  brings  the  campaign  to  an  end  faster 
than  the  other  approaches  shown;  its  lack  of  anticipative  capabilities  shows  in  that  it  does  not  attain  additional  value 
that  the  MMR  heuristic,  for  example,  can  at  the  expense  of  a  longer  campaign.  This  is  also  illustrated  in  Figure 
7:  The  surrogate  problem  method  is  seen  to  destroy  approximately  the  same  number  of  targets  as  the  other  three 
approaches  in  less  time,  but  at  the  expense  of  higher  asset  attrition.  The  “MMR+Surrogate”  controller,  on  the  other 
hand,  can  significantly  reduce  attrition  at  the  expense  of  a  somewhat  longer  campaign. 

5.  CONCLUSIONS  AND  FUTURE  WORK 

The  quest  for  agile  control  in  a  JAO  environment  depends  on  appropriate  battle  space  modeling  and  formulation  of 
a  stochastic  dynamic  optimization  problem  for  mission  planning.  In  this  paper,  we  have  described  the  complexities 
associated  with  these  tasks  and  some  approaches  for  solving  optimal  mission  planning  problems.  These  approaches 
combine  approximation  methods  for  solving  notoriously  hard  dynamic  programmimg  problems  with  more  recent 
advances  in  optimal  resource  allocation  techniques.  They  all  require  estimating  performance-related  quantities  under 
alternative  controllers,  which  is  often  accomplished  using  discrete  event  simulation.  Thus,  part  of  our  ongoing  work 
involves  the  development  of  simulation  tools  and  related  efficient  estimation  methods. 

Our  ultimate  goal  is  to  develop  closed-loop  control  and  optimization  methods  capable  of  taking  advantage  of  battle 
space  data  as  they  become  available  in  real  time,  while  also  anticipating  uncertain  future  events.  Toward  this  goal, 
we  are  exploring  new  approximation  methods  and  optimization  algorithms  that  can  incorporate  simulation-based 
estimates.  It  is  possible,  for  example,  to  exploit  the  structure  of  the  uncertainty  factors  entering  in  this  problem  by 
extracting  key  features  rendering  estimation  more  efficient  with  little  loss  of  accuracy.  At  the  same  time,  we  need 
to  include  additional  features  of  the  JAO  setting  into  our  models,  such  as  the  presence  of  time-critical  targets  or 
“intelligent”  threats. 


Surrogate 


Surrogate 


Surrogate 

^  0.25  1 . 

o  ! 


Greedy  MMR  MMR't-  Surrogate 
Surrogate 


Figure  7.  Performance  comparison  over  different  solution  approaches  (single  mission  wave) 


ACKNOWLEDGMENTS 

This  work  is  supported  in  part  by  the  Air  Force  Research  Laboratory  under  contract  F30602-99-C-0057  and  by 
AFOSR  under  grant  F49620-01-0056. 


REFERENCES 

1.  E.  Aarts  and  J.  Korst.  Simulated  Annealing  and  Boltzmann  Machines.  Wiley,  New  York,  NY,  1989. 

2.  D.  P.  Bertsekas.  Dynamic  Programming  and  Optimal  Control  Athena  Scientific,  Belmont,  Massachusetts,  1995. 

3.  D.R  Bertsekas  and  J.N.  Tsitsiklis.  Rollout  algorithms  for  combinatorial  optimization.  Journal  of  Heuristics, 
3(3):245-262,  1997. 

4.  C.  G.  Cassandras,  L.  Dai,  and  C.  G.  Panayiotou.  Ordinal  optimization  for  deterministic  and  stochastic  resource 
allocation.  IEEE  Trans.  Automatic  Control,  43(7):881-900,  1998. 

5.  C.  G.  Cassandras  and  S.  Lafortune.  Introduction  to  Discrete  Event  Systems.  Kluwer  Academic  Publishers,  1999. 

6.  C.  G.  Cassandras  and  C.  G.  Panayiotou.  Concurrent  sample  path  analysis  of  discrete  event  systems.  Journal 
of  Discrete  Event  Dynamic  Systems:  Theory  and  Applications,  9:171-195,  1999. 

7.  K.  Gokbayrak  and  C.  G.  Cassandras.  A  generalized  ‘surrogate  problem’  methodology  for  on-line  stochastic 
discrete  optimization.  J.  of  Optimization  Theory  and  Applications.  Submitted  2001. 

8.  K.  Gokbayrak  and  C.  G.  Cassandras.  An  on-line  ‘surrogate  problem’  methodology  for  stochastic  discrete  resource 
allocation  problems.  J.  of  Optimization  Theory  and  Applications,  108(2):349-376,  2001. 

9.  W.  B.  Gong,  Y.  C.  Ho,  and  W.  Zhai.  Stochastic  comparison  algorithm  for  discrete  optimization  with  estimation. 
Proc.  of  31st  IEEE  Conf.  on  Decision  and  Control,  pages  795-800,  1992. 

10.  Y.  C.  Ho,  R.  S.  Sreenivas,  and  P.  Vakili.  Ordinal  optimization  in  BEDS.  J.  of  Discrete  Event  Dynamic  Systems: 
Theory  and  Applications,  2:61-88,  1992. 

11.  J.H.  Holland.  Adaptation  in  Natural  and  Artificial  Systems.  University  of  Michigan  Press,  Ann  Arbor,  MI,  1975. 

12.  H.  J.  Kushner  and  D.S.  Clark.  Stochastic  Approximation  for  Constrained  and  Unconstrained  Systems.  Springer- 
Verlag,  Berlin,  Germany,  1978. 

13.  L.  Shi  ^d  S.  Olafsson.  Nested  partitions  method  for  global  optimization.  Operations  Research,  48:390-407, 

2000. 


14.  J.M.  Wohletz,  D.A.  Castanon,  and  M.L.  Curry.  Closed-loop  control  for  joint  air  operations.  In  Proceedings  of 
2001  American  Control  Conference^  2001.  To  appear. 

15.  D.  Yan  and  H.  Mukai.  Stochastic  discrete  optimization.  SIAM  Journal  on  Control  and  Optimization^  30:549-612, 
1992. 


